Flink and Cassandra deployment similar to Spark? - cassandra

DataStax bundles Spark directly into it's DSE and most documentation I've seen recommends co-locating Spark with each Cassandra node, so that the spark-cassandra-connector works most efficiently with the data of that node.
Does Flink's Cassandra connector optimize it's data access based on Cassandra partition key hashes as well? If so, does Flink recommend a similar co-located install of Flink and C* on the same nodes?

Related

Is the Spark-Cassandra-connector node aware?

Is Datastax Cassandra community edition integration with Spark community edition using spark-cassandra-connector community edition node aware or is this feature reserved for Enterprise editions only?
By node awareness I mean if Spark will send job execution to the nodes owning the data
Yes, the Spark connector is node-aware and will function in that manner with both DSE and (open source) Apache Cassandra.
In fact on a SELECT it knows how to hash the partition keys to a token, and send queries on specific token ranges only to the nodes responsible for that data. It can do this, because (like the Cassandra Java driver) it has a window into node-to-node gossip and can see things like node status (up/down) and token range assignment.
In Spark it is referred to as data locality.
Data locality can only be achieved if both the JVMs for Cassandra and the Spark worker/executor are co-located within the same OSI. By definition, data can only be local if the executors doing the processing are running on the same server (OSI) as the Cassandra node.
During the initial contact with the cluster, the driver retrieves information about the cluster topology -- available nodes, rack/DC configuration, token ownership. Since the driver is aware of where nodes are located, it will always try to connect to the "closest" node in the same (local) data centre.
If the Spark workers/executors are co-located with the Cassandra node, the Spark-Cassandra-connector will process Spark partitions on the nodes that own the data where possible to reduce the amount of data shuffling across the network.
There are methods such as joinWithCassandraTable() which maximise data locality where possible. Additionally, the repartitionByCassandraReplica() method splits the Spark partitions so they are mapped to Cassandra replicas that own the data.
This functionality works for both open-source Cassandra clusters and DataStax Enterprise clusters. Cheers!

How Spark writes/reads process through spark-Cassandra connector different from CQLSH read/write process

I am new to spark , trying to understand , how is spark advantageous when using it through spark-Cassandra connector on Cassandra cluster.
How does write (example savetocassandra) to Cassandra works through spark-Cassandra connector (spark SQL queries , does it involve coordinator node still?
How does read to Cassandra works through spark-Cassandra connector (spark SQL queries) , does it involve coordinator node still?
what makes spark overcome the load of Cassandra , during high range read scans on the cluster?
How does a high range scan cql read query gets executed on Cassandra cluster through spark-Cassandra connector?
using IN clause through spark-Cassandra connector on Cassandra cluster is advantage?
Here is a good explanation. I also recommend other Russell talks, if you want to understand spark-cassandra-connector internals
Cassandra and Spark Optimizing for Data Locality - Russell Spitzer (DataStax)
https://www.youtube.com/watch?v=ikCzILOpYvA

Spark-Cassandra connector data reading

I'm having cluster of Cassandra nodes with Spark worker on each node machine. For communication I'm using Datastax Spark-Cassasndra connector. Does the Datastax connector have optimisation for reading of data from Cassandra node by worker in same machine or exists some dataflow betweens machines?
Yes. It indeed does.
It is explained in this document.
http://www.slideshare.net/SparkSummit/cassandra-and-spark-optimizing-russell-spitzer-1
Hope this helps!

Apache Spark and Zookeeper multi-region deployment?

Has anyone been using apache spark on multi-region?
We are building an application that must be multi-region deployed. Our stack is basically Scala, Spark, Cassandra and Kafka. The main goal is to use Spark streaming with Kafka and insert it on Cassandra.
Reading the Spark documentation, Zookeeper is needed to keep high availability as well as in Kafka.
The question is: Should I consider keep a spark cluster on each region or should I use like cassandra? Since it depends on zookeeper to keep high availability on master nodes, how about that? The same applies to zookeeper or not?

DataStax Enterprise with HDFS and Spark without Cassandra

Is it possible to work with DSE, HDFS, Spark, but without Cassandra?
I try to replace CFS (Cassandra File System) with HDFS (Hadoop in DSE)
dse hadoop fs -help
needs cassandra.
Cassandra takes a lot of memory, I hope that with HDFS-only we've get more free-RAM on node.
Calling DSE Hadoop is actually using the Cassandra file system instead of HDFS so you cannot run it without Cassandra running. Datastax does support a BYOH (bring your own Hadoop) option but that involves using a third party Hadoop. If you don't want Cassandra though I would not recommend using the DSE packaging.

Resources