Spark-Cassandra connector data reading - apache-spark

I'm having cluster of Cassandra nodes with Spark worker on each node machine. For communication I'm using Datastax Spark-Cassasndra connector. Does the Datastax connector have optimisation for reading of data from Cassandra node by worker in same machine or exists some dataflow betweens machines?

Yes. It indeed does.
It is explained in this document.
http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
Hope this helps!

Related

Is the Spark-Cassandra-connector node aware?

Is Datastax Cassandra community edition integration with Spark community edition using spark-cassandra-connector community edition node aware or is this feature reserved for Enterprise editions only?
By node awareness I mean if Spark will send job execution to the nodes owning the data
Yes, the Spark connector is node-aware and will function in that manner with both DSE and (open source) Apache Cassandra.
In fact on a SELECT it knows how to hash the partition keys to a token, and send queries on specific token ranges only to the nodes responsible for that data. It can do this, because (like the Cassandra Java driver) it has a window into node-to-node gossip and can see things like node status (up/down) and token range assignment.
In Spark it is referred to as data locality.
Data locality can only be achieved if both the JVMs for Cassandra and the Spark worker/executor are co-located within the same OSI. By definition, data can only be local if the executors doing the processing are running on the same server (OSI) as the Cassandra node.
During the initial contact with the cluster, the driver retrieves information about the cluster topology -- available nodes, rack/DC configuration, token ownership. Since the driver is aware of where nodes are located, it will always try to connect to the "closest" node in the same (local) data centre.
If the Spark workers/executors are co-located with the Cassandra node, the Spark-Cassandra-connector will process Spark partitions on the nodes that own the data where possible to reduce the amount of data shuffling across the network.
There are methods such as joinWithCassandraTable() which maximise data locality where possible. Additionally, the repartitionByCassandraReplica() method splits the Spark partitions so they are mapped to Cassandra replicas that own the data.
This functionality works for both open-source Cassandra clusters and DataStax Enterprise clusters. Cheers!

How Spark writes/reads process through spark-Cassandra connector different from CQLSH read/write process

I am new to spark , trying to understand , how is spark advantageous when using it through spark-Cassandra connector on Cassandra cluster.
How does write (example savetocassandra) to Cassandra works through spark-Cassandra connector (spark SQL queries , does it involve coordinator node still?
How does read to Cassandra works through spark-Cassandra connector (spark SQL queries) , does it involve coordinator node still?
what makes spark overcome the load of Cassandra , during high range read scans on the cluster?
How does a high range scan cql read query gets executed on Cassandra cluster through spark-Cassandra connector?
using IN clause through spark-Cassandra connector on Cassandra cluster is advantage?
Here is a good explanation. I also recommend other Russell talks, if you want to understand spark-cassandra-connector internals
Cassandra and Spark Optimizing for Data Locality - Russell Spitzer (DataStax)
https://www.youtube.com/watch?v=ikCzILOpYvA

Flink and Cassandra deployment similar to Spark?

DataStax bundles Spark directly into it's DSE and most documentation I've seen recommends co-locating Spark with each Cassandra node, so that the spark-cassandra-connector works most efficiently with the data of that node.
Does Flink's Cassandra connector optimize it's data access based on Cassandra partition key hashes as well? If so, does Flink recommend a similar co-located install of Flink and C* on the same nodes?

Spark with replicated Cassandra nodes

I found article where author advices to use next Spark-Cassandra architecture schema(Spark Slave for each Cassandra node):
I have N Cassandra nodes. All nodes are complete replicas of each other. Is some sense to run Spark slave for each Cassandra node in my case?
Yes it does. The Spark-Cassandra connector is data locality aware, i.e. each Spark node co-located with a Cassandra node will make sure to only process the local Cassandra data, which avoids shuffling lots of data across the network. You can find out how this works by watching a talk by Russell Spitzer on this topic here.

Performing Analytics over Cassandra DB

I am working for a small concern and very new to apache cassandra. Studying about cassandra and performing some small analytics like sum function on cassandra DB for creating reports. For the same, Hive and Accunu can be choices.
Datastax Enterprise provides the solution for Apache Cassandra and Hive Integration. Is Datastax Enterprise is the only solution for such integration. Is there any way to resolve the hive and cassandra integration. If so, Can I get the links or documents regarding the same. Is that possible to work the same with the windows platform.
Is any other solution to perform analytics on cassandra DB?
Thanks in advance .
I was trying to download DataStax Enterprise (DSE) for Windows but found there is no such option on their website. I suppose they do not support DSE for Windows.
Apache Cassandra does have builtin Hadoop support. You need to set up a standalone Hadoop cluster colocated with Apache Cassandra nodes and then use ColumnFamilyInputFormat and ColumnFamilyOutputFormat to read/write data from/to your Hadoop cluster.

Resources