How to distribute specific data to each cluster node in spark? - apache-spark

I am deploying my program in spark cluster and I need to give each node a specific list of data that I decide on. How can I do this? I created an RDD object out of my data but I don't know how to pass the specific part of data to each node.

I don't think, you can pass a specific list to the node. If your data have unique keys, then you can use hash technique to send same keys on specific partition

Not possible as you have no control which Worker Nodes are allocated, and, N Executors may be on same Worker Node.

Related

How Spark-cassandra-connector determines the range to query on Cassandra?

I have a three node Cassandra cluster with Spark executor running on each node. I understand that to scan the Cassandra database, SCC(Spark-Cassandra-Connector) uses range query putting tokens in where clause. How a SCC instance running on different node is able to select ranges different from other SCC instances running on other nodes. For example a SCC instance A on node1 picks a range RangeA, then how SCC instances B and C decides not to use the same range RangeA?
Do they communicate with each other?
Driver when executing the action is generating the list of partitions that will be then mapped into Spark partitions and distributed between the worker nodes. The generation of partitions really depends on the multiple factors (you can look into ScanHelper.getPartitionGenerator function):
does the WHERE condition contain the partition key or not
is partition count already specified or not
Based on that, it's returning an instance of the CassandraPartitionGenerator class that is performing real generation of partitions with partitions function that fetch the list of token ranges from the cluster , if necessary splits these token ranges into the smaller token ranges, group them by to which nodes they belong, etc.
That instance of CassandraPartitionGenerator is then used either by DataFrame or RDD APIs to get the list of Spark partitions that will be scheduled for execution by Spark. And at the end these partitions are converted into the CQL where clauses by CqlTokenRange class.
P.S. Russel Spitzer also wrote a blog post on Spark data locality & the Spark Cassandra Connector - this could be also useful for understanding.
Spark-cassandra-connector basics
The spark-cassandra-connector has fairly complicated internals, but the most important things (overly simplified) are the following:
the connector would naturally prefer to query locally. E.g to avoid network and to have the spark executor query its local cassandra node
to do that, the driver needs to understand the Cassandra topology and where the token ranges you need to query are (there is an initial ring describe done by the driver, so after that there is a full understanding where to find what part of your token)
after understanding where the token ranges are, and mapping each token to an IP, the connector spreads the work in such a way that each local spark executor queries that part of the range that is local to it
More detailed information
It's a bit more complex than that, but that's it in a nutshell. I think this video from Datastax explains it a bit better.
You might also want to consider reading this question (with, admittedly, a vague answer).
How you structure your data is important for this to work out of the box
Note that there is a bit of skill/knowledge required to structure your data and your query in such a way that the driver can try to do that.
Actually, the most common type of performance problems usually stem from badly structured data or queries leading to non-local execution. The datastax java driver, and the spark-cassandra-connector internally try their best effort to make the queries local, but you need to also follow the best practices in structuring your data. If you haven't already done so, I recommend reading/going through the trainings described in the Data Modeling By Example articles by DataStax.
Edit: queries without locality
As you mentioned, sometimes the executors don't reside on the same host as the nodes. Still, the principle is the same:
When you have a query, it is over a certain token range. Some of the data for this query will be "owned" by node A, some of the data will be "owned" by node B, and some by node C.
The ring describe operation tells the driver, for a certain range, which part of it is in node A, which in node B, and which in node C. The driver then essentially splits the query in 3 subqueries and asks for it from the appropriate nodes which own the particular range.
Each node responds with their own portion, and at the end the driver aggregates it.
You might notice that local or not, the principle is exactly the same:
ask each node only about the particular range it owns, which the driver learned earlier by using the ring describe operation.
Hope that makes it a bit clearer.

Determining the distribution of data within the cluster in Spark

I want to examine the distribution of my data within the cluster. I know how to find out what data is inside each partition. However, I haven't figured out how to find out the distribution of the data within the cluster.
Does a method exist in Spark to find out which rows or how many rows of a data frame are on a particular node within the cluster?
Or alternatively, is there a method to map from the partition ID to the executor ID?
Kind regards

run a function on all the spark executors to get licence key

before beginning to scan a database, all the executors need a licence file. I have a function licenceFileInstaller.install() that downloads the file at the right location on the system, which works fine on the master node. How do I run this on all the executors ?
I guess when you use this method in rdd transformations, it should serialize this method and send it across all the nodes. Like you can use this method within mapPartitions and it will be distributed across all the nodes and run once for every partition.
rdd.mapPartitions( x => ( licenceFileInstaller.install() ...further database code ))
Also, I guess you can add that license file using sc.addFile() which will be distributed across all the nodes.
You can refer to Spark closures for further understanding

In spark streaming can I create RDD on worker

I want to know how can I create RDD on worker say containing a Map. This Map/RDD will be small and I want this RDD to completely reside on one machine/executor (I guess repartition(1) can achieve this). Further I want to be able to cache this Map/RDD on local executor and use it in tasks running on this executor for lookup.
How can I do this?
No, you cannot create RDD in worker node. Only driver can create RDD.
The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.
You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast
You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources