I have Cassandra cluster with N nodes on N machines. Also I have spark worker on every machine. For reading from Cassandra I'm using Datastax spark-cassandra connector. When I'm setting workers (Standalone mode) I only say master host for them. In driver I'm specifying Cassandra seeds by spark.cassandra.connection.host property. I saw many presentations about data locality.But I found no info about how spark-cassandra connector selects local node for each worker. Which algorithm connector uses for this?
The connector is token-aware. It ensures data locality by adding token range filtering such as token("partition-key") > ? AND token("partition-key") <= ? to the query you run.
The connector uses the parameter spark.cassandra.input.split.size_in_mb (default to 64MB) to generate queries (token(...) >? AND token(...)< ?) that will load about 64MB of data in each Spark partition.
Related
Is Datastax Cassandra community edition integration with Spark community edition using spark-cassandra-connector community edition node aware or is this feature reserved for Enterprise editions only?
By node awareness I mean if Spark will send job execution to the nodes owning the data
Yes, the Spark connector is node-aware and will function in that manner with both DSE and (open source) Apache Cassandra.
In fact on a SELECT it knows how to hash the partition keys to a token, and send queries on specific token ranges only to the nodes responsible for that data. It can do this, because (like the Cassandra Java driver) it has a window into node-to-node gossip and can see things like node status (up/down) and token range assignment.
In Spark it is referred to as data locality.
Data locality can only be achieved if both the JVMs for Cassandra and the Spark worker/executor are co-located within the same OSI. By definition, data can only be local if the executors doing the processing are running on the same server (OSI) as the Cassandra node.
During the initial contact with the cluster, the driver retrieves information about the cluster topology -- available nodes, rack/DC configuration, token ownership. Since the driver is aware of where nodes are located, it will always try to connect to the "closest" node in the same (local) data centre.
If the Spark workers/executors are co-located with the Cassandra node, the Spark-Cassandra-connector will process Spark partitions on the nodes that own the data where possible to reduce the amount of data shuffling across the network.
There are methods such as joinWithCassandraTable() which maximise data locality where possible. Additionally, the repartitionByCassandraReplica() method splits the Spark partitions so they are mapped to Cassandra replicas that own the data.
This functionality works for both open-source Cassandra clusters and DataStax Enterprise clusters. Cheers!
I have a DSE cluster where every node in the cluster has both spark and Cassandra running.
When I load data from Cassandra to spark Rdd and do some action on the rdd, i know the data would be distributed into multi nodes. In my case, I want to write these rdds from every node to its local Cassandra dB table directly, is there anyway to do it.
If I do normal rdd collect, all data from spark nodes would be merged and go back to node with driver.
I do not want this to happen as the data flow from nodes back to driver node may take Long time, I want the data been save to local node directly to avoid the data movement across the spark nodes.
When Spark executor is reading data from Cassandra it's sending request to the "best node" that is selected based on the different factors:
When Spark is collocated with Cassandra, then Spark is trying to pull data from the same node
When Spark is on different node, then it's using token-aware routing, and read data from multiple nodes in parallel, as it's defined by the partition ranges.
When it's comes to the writing, and you have multiple executors, then each executor is opening multiple connections to each node, and writing the data using the token-aware routing, meaning that data is sent directly to one of the replicas. Also, Spark is trying to batch multiple rows that are belonging to the same partition into an UNLOGGED BATCH as it's more performant. Even if the Spark partition is colocated with the Cassandra partition, writing could involve an additional network overhead as SCC is writing using the consistency level TWO.
You can get colocated data if you re-partitioned the data to match Cassandra's partitioning), but such re-partition may induce Spark shuffle that could be much more heavyweight compared to the writing data from executor to another node.
P.S. You can find a lot of additional information about Spark Cassandra Connector in the Russell Spitzer's blog.
A word of warning: i only use Cassandra and Spark as separate open source projects, i do not have expertise with DSE.
I am afraid the data need to hit the network to replicate, even when every spark node talks to its local cassandra node.
Without replication and running a Spark job to make sure all data is hashed and preshuffled to the corresponding Cassandra node, it should be possible to use 127.0.0.1:9042 and avoid the network.
As Apache Spark is a suggested distributed processing engine for Cassandra, I know that there is a possibility to run Spark executors along with Cassandra nodes.
My question is if the driver and Spark connector are smart enough to understand partitioning and shard allocation so data are processed in a hyper-converged manner.
Simply, does the executors read data stored from partitions that are hosted on nodes where an executor is running so no unnecessary data are transferred across the network as Spark does when it's run over HDFS?
Yes, Spark Cassandra Connector is able to do this. From the source code:
The getPreferredLocations method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes. If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.
Theoretically yes. Same for HDFS too. Howevet practically I have seen less of it on the cloud where separate nodes are used for spark and Cassandra when their cloud services are used. If you use IAsAS and setup your own Cassandra and Spark then you can achieve it.
I would like to add to Alex's answer:
Yes, Spark Cassandra Connector is able to do this. From the source
code:
The getPreferredLocations method tells Spark the preferred nodes to
fetch a partition from, so that the data for the partition are at the
same node the task was sent to. If Cassandra nodes are collocated with
Spark nodes, the queries are always sent to the Cassandra process
running on the same node as the Spark Executor process, hence data are
not transferred between nodes. If a Cassandra node fails or gets
overloaded during read, the queries are retried to a different node.
That this is a bad behavior.
In Cassandra when you ask to get the data of a particular partition, only one node is accessed. Spark can actually access 3 nodes thanks to the replication. So without shuffeling you have 3 nodes participating in the job.
In Hadoop however, when you ask to get the data of a particular partition, usually all nodes in the cluster are accessed and then Spark uses all nodes in the cluster as executors.
So in case you have a 100 nodes: In Cassandra, Spark will take advantage of 3 nodes. In Hadoop, Spark will take advantage of a 100 nodes.
Cassandra is optimized for real-time operational systems, and therefore not optimized for analytics like data lakes.
I am new to spark , trying to understand , how is spark advantageous when using it through spark-Cassandra connector on Cassandra cluster.
How does write (example savetocassandra) to Cassandra works through spark-Cassandra connector (spark SQL queries , does it involve coordinator node still?
How does read to Cassandra works through spark-Cassandra connector (spark SQL queries) , does it involve coordinator node still?
what makes spark overcome the load of Cassandra , during high range read scans on the cluster?
How does a high range scan cql read query gets executed on Cassandra cluster through spark-Cassandra connector?
using IN clause through spark-Cassandra connector on Cassandra cluster is advantage?
Here is a good explanation. I also recommend other Russell talks, if you want to understand spark-cassandra-connector internals
Cassandra and Spark Optimizing for Data Locality - Russell Spitzer (DataStax)
https://www.youtube.com/watch?v=ikCzILOpYvA
I found article where author advices to use next Spark-Cassandra architecture schema(Spark Slave for each Cassandra node):
I have N Cassandra nodes. All nodes are complete replicas of each other. Is some sense to run Spark slave for each Cassandra node in my case?
Yes it does. The Spark-Cassandra connector is data locality aware, i.e. each Spark node co-located with a Cassandra node will make sure to only process the local Cassandra data, which avoids shuffling lots of data across the network. You can find out how this works by watching a talk by Russell Spitzer on this topic here.