How Spark-cassandra-connector determines the range to query on Cassandra? - apache-spark

I have a three node Cassandra cluster with Spark executor running on each node. I understand that to scan the Cassandra database, SCC(Spark-Cassandra-Connector) uses range query putting tokens in where clause. How a SCC instance running on different node is able to select ranges different from other SCC instances running on other nodes. For example a SCC instance A on node1 picks a range RangeA, then how SCC instances B and C decides not to use the same range RangeA?
Do they communicate with each other?

Driver when executing the action is generating the list of partitions that will be then mapped into Spark partitions and distributed between the worker nodes. The generation of partitions really depends on the multiple factors (you can look into ScanHelper.getPartitionGenerator function):
does the WHERE condition contain the partition key or not
is partition count already specified or not
Based on that, it's returning an instance of the CassandraPartitionGenerator class that is performing real generation of partitions with partitions function that fetch the list of token ranges from the cluster , if necessary splits these token ranges into the smaller token ranges, group them by to which nodes they belong, etc.
That instance of CassandraPartitionGenerator is then used either by DataFrame or RDD APIs to get the list of Spark partitions that will be scheduled for execution by Spark. And at the end these partitions are converted into the CQL where clauses by CqlTokenRange class.
P.S. Russel Spitzer also wrote a blog post on Spark data locality & the Spark Cassandra Connector - this could be also useful for understanding.

Spark-cassandra-connector basics
The spark-cassandra-connector has fairly complicated internals, but the most important things (overly simplified) are the following:
the connector would naturally prefer to query locally. E.g to avoid network and to have the spark executor query its local cassandra node
to do that, the driver needs to understand the Cassandra topology and where the token ranges you need to query are (there is an initial ring describe done by the driver, so after that there is a full understanding where to find what part of your token)
after understanding where the token ranges are, and mapping each token to an IP, the connector spreads the work in such a way that each local spark executor queries that part of the range that is local to it
More detailed information
It's a bit more complex than that, but that's it in a nutshell. I think this video from Datastax explains it a bit better.
You might also want to consider reading this question (with, admittedly, a vague answer).
How you structure your data is important for this to work out of the box
Note that there is a bit of skill/knowledge required to structure your data and your query in such a way that the driver can try to do that.
Actually, the most common type of performance problems usually stem from badly structured data or queries leading to non-local execution. The datastax java driver, and the spark-cassandra-connector internally try their best effort to make the queries local, but you need to also follow the best practices in structuring your data. If you haven't already done so, I recommend reading/going through the trainings described in the Data Modeling By Example articles by DataStax.
Edit: queries without locality
As you mentioned, sometimes the executors don't reside on the same host as the nodes. Still, the principle is the same:
When you have a query, it is over a certain token range. Some of the data for this query will be "owned" by node A, some of the data will be "owned" by node B, and some by node C.
The ring describe operation tells the driver, for a certain range, which part of it is in node A, which in node B, and which in node C. The driver then essentially splits the query in 3 subqueries and asks for it from the appropriate nodes which own the particular range.
Each node responds with their own portion, and at the end the driver aggregates it.
You might notice that local or not, the principle is exactly the same:
ask each node only about the particular range it owns, which the driver learned earlier by using the ring describe operation.
Hope that makes it a bit clearer.

Related

How are the task results being processed on Spark?

I am new to Spark and I am currently try to understand the architecture of spark.
As far as I know, the spark cluster manager assigns tasks to worker nodes and sends them partitions of the data. Once there, each worker node performs the transformations (like mapping etc.) on its own specific partition of the data.
What I don't understand is: where do all the results of these transformations from the various workers go to? are they being sent back to the cluster manager / driver and once there reduced (e.g. sum of values of each unique key)? If yes, is there a specific way this happens?
Would be nice if someone is able to enlighten me, neither the spark docs nor other Resources concerning the architecture haven't been able to do so.
Good question, I think you are asking how does a shuffle work...
Here is a good explanation.
When does shuffling occur in Apache Spark?

Batch processing job (Spark) with lookup table that's too big to fit into memory

I'm trying to write a batch job to process a couple of hundreds of terabytes that currently sit in an HBase database (in an EMR cluster in AWS), all in a single large table. For every row I'm processing, I need to get additional data from a lookup table (a simple integer to string mapping) that is in a second HBase table. We'd be doing 5-10 lookups per row.
My current implementation uses a Spark job that's distributing partitions of the input table to its workers, in the following shape:
Configuration hBaseConfig = newHBaseConfig();
hBaseConfig.set(TableInputFormat.SCAN, convertScanToString(scan));
hBaseConfig.set(TableInputFormat.INPUT_TABLE, tableName);
JavaPairRDD<ImmutableBytesWritable, Result> table = sparkContext.newAPIHadoopRDD(hBaseConfig, TableInputFormat.class, ImmutableBytesWritable.class, Result.class);
table.map(val -> {
// some preprocessing
}).foreachPartition(p -> {
p.forEachRemaining(row -> {
// code that does the lookup
});
});
The problem is that the lookup table is too big to fit in the workers' memory. They all need access to all parts of the lookup table, but their access pattern would significantly benefit from a cache.
Am I right in thinking that I cannot use a simple map as a broadcast variable because it'd need to fit into memory?
Spark uses a shared nothing architecture, so I imagine there won't be an easy way to share a cache across all workers, but can we build a simple LRU cache for every individual worker?
How would I implement such a local worker cache that gets the data from the lookup table in HBase on a cache miss? Can I somehow distribute a reference to the second table to all workers?
I'm not set on my choice of technology, apart from HBase as the data source. Is there a framework other than Spark which could be a better fit for my use case?
You have a few of options for dealing with this requirement:
1- Use RDD or Dataset joins
You can load both of your HBase tables as Spark RDD or Datasets and then do a join on your lookup key.
Spark will split both RDD into partitions and shuffle content around so that rows with the same keys end up on the same executors.
By managing the number of number of partitions within spark you should be able to join 2 tables on any arbitrary sizes.
2- Broadcast a resolver instance
Instead of broadcasting a map, you can broadcast a resolver instance that does a HBase lookup and temporary LRU cache. Each executor will get a copy of this instance and can manage its own cache and you can invoke them within for foreachPartition() code.
Beware, the resolver instance needs to implement Serializable so you will have to declare the cache, HBase connections and HBase Configuration properties as transient to be initialized on each executor.
I run such a setup in Scala on one of the projects I maintain: it works and can be more efficient than the straight Spark join if you know your access patterns and manage you cache efficiently
3- Use the HBase Spark connector to implement your lookup logic
Apache HBase has recently incorporated improved HBase Spark connectors
The documentation is pretty sparse right now, you need to look at the JIRA tickets and the documentation of the previous incarnation of these tools
Cloudera's SparkOnHBase but the last unit test in the test suite looks pretty much like what you want
I have no experience with this API though.

"total-executor-cores" parameter in Spark in relation to Data Nodes

Another item that I read little about.
Leaving S3 aside, and not in the position just now to try out on a bare metal classic data locality approach to Spark, Hadoop, and not in Dynamic Resource Allocation mode, then:
What if a large dataset in HDFS is distributed over (all) N data nodes in the Cluster, but the total-executor-cores parameter is set lower than N, and we need to read all the data on obviously (all) N relevant Data Nodes?
I assume Spark has to ignore this parameter for reading from HDFS. Or not?
If it is ignored, an Executor Core needs to be allocated on that Data Node and is thus acquired by the overall Job and thus this parameter needs to be interpreted to mean for processing and not for reading blocks?
Is the data from such a Data Node immediately shuffled to where the Executors were allocated?
Thanks in advance.
There seems to be little bit of confusion here.
Optimal Data locality (node local) is something we want to achieve, not guarantee. All Spark can do is request resources (for example with YARN - How YARN knows data locality in Apache spark in cluster mode) and hope that it will get resources, which satisfy data locality constraints.
If it doesn't it will simply fetch data from remote nodes. However it is not shuffle. It just a simple transfer over network.
So to answer your question - Spark will use resource which has been allocated, trying to do its best do satisfy the constraints. It cannot use nodes, which hasn't been acquired, so it won't automatically get additional nodes for reads.

Spark Ingestion path: "Source to Driver to Worker" or "Source to Workers"

When Spark ingest the Data, is there specific situation where it has to go trough the driver and then from the driver the worker ? Same question apply for a direct read by the worker.
I guess i am simply trying to map out what are the condition or situation that lead to one way or the other, and how does partitioning happen in each case.
If you limit yourself to built-in methods then unless you create distributed data structure from a local one with method like:
SparkSession.createDataset
SparkContext.parallelize
data is always accessed directly by the workers, but the details of the data distribution will vary from source to source.
RDDs typically depend on Hadoop input formats, but Spark SQL and data source API, are at least partially independent, at least when it comes to configuration,
It doesn't mean data is always properly distributed. In some cases (JDBC, streaming receivers) data may still be piped trough a single node.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources