Spark local rdd Write to local Cassandra DB - apache-spark

I have a DSE cluster where every node in the cluster has both spark and Cassandra running.
When I load data from Cassandra to spark Rdd and do some action on the rdd, i know the data would be distributed into multi nodes. In my case, I want to write these rdds from every node to its local Cassandra dB table directly, is there anyway to do it.
If I do normal rdd collect, all data from spark nodes would be merged and go back to node with driver.
I do not want this to happen as the data flow from nodes back to driver node may take Long time, I want the data been save to local node directly to avoid the data movement across the spark nodes.

When Spark executor is reading data from Cassandra it's sending request to the "best node" that is selected based on the different factors:
When Spark is collocated with Cassandra, then Spark is trying to pull data from the same node
When Spark is on different node, then it's using token-aware routing, and read data from multiple nodes in parallel, as it's defined by the partition ranges.
When it's comes to the writing, and you have multiple executors, then each executor is opening multiple connections to each node, and writing the data using the token-aware routing, meaning that data is sent directly to one of the replicas. Also, Spark is trying to batch multiple rows that are belonging to the same partition into an UNLOGGED BATCH as it's more performant. Even if the Spark partition is colocated with the Cassandra partition, writing could involve an additional network overhead as SCC is writing using the consistency level TWO.
You can get colocated data if you re-partitioned the data to match Cassandra's partitioning), but such re-partition may induce Spark shuffle that could be much more heavyweight compared to the writing data from executor to another node.
P.S. You can find a lot of additional information about Spark Cassandra Connector in the Russell Spitzer's blog.

A word of warning: i only use Cassandra and Spark as separate open source projects, i do not have expertise with DSE.
I am afraid the data need to hit the network to replicate, even when every spark node talks to its local cassandra node.
Without replication and running a Spark job to make sure all data is hashed and preshuffled to the corresponding Cassandra node, it should be possible to use 127.0.0.1:9042 and avoid the network.

Related

Is the Spark-Cassandra-connector node aware?

Is Datastax Cassandra community edition integration with Spark community edition using spark-cassandra-connector community edition node aware or is this feature reserved for Enterprise editions only?
By node awareness I mean if Spark will send job execution to the nodes owning the data
Yes, the Spark connector is node-aware and will function in that manner with both DSE and (open source) Apache Cassandra.
In fact on a SELECT it knows how to hash the partition keys to a token, and send queries on specific token ranges only to the nodes responsible for that data. It can do this, because (like the Cassandra Java driver) it has a window into node-to-node gossip and can see things like node status (up/down) and token range assignment.
In Spark it is referred to as data locality.
Data locality can only be achieved if both the JVMs for Cassandra and the Spark worker/executor are co-located within the same OSI. By definition, data can only be local if the executors doing the processing are running on the same server (OSI) as the Cassandra node.
During the initial contact with the cluster, the driver retrieves information about the cluster topology -- available nodes, rack/DC configuration, token ownership. Since the driver is aware of where nodes are located, it will always try to connect to the "closest" node in the same (local) data centre.
If the Spark workers/executors are co-located with the Cassandra node, the Spark-Cassandra-connector will process Spark partitions on the nodes that own the data where possible to reduce the amount of data shuffling across the network.
There are methods such as joinWithCassandraTable() which maximise data locality where possible. Additionally, the repartitionByCassandraReplica() method splits the Spark partitions so they are mapped to Cassandra replicas that own the data.
This functionality works for both open-source Cassandra clusters and DataStax Enterprise clusters. Cheers!

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?

Is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode? if yes, why?
Spark is distributed data processing engine used for computing huge volumes of data. Let's say I have huge volume of data stored in mysql which I want to perform processing on. Spark reads the data from mysql and perform in-memory (or disk) computation on the cluster nodes itself. I am still not able to understand why is distributed file storage needed to run spark in a clustered mode?
is distributed file storage(HDFS/Cassandra/S3 etc.) mandatory for spark to run in clustered mode?
Pretty Much
if yes, why?
Because the spark workers take input from a shared table, distribute the computation amongst themselves, then are choreographed by the spark driver to write their data back to another shared table.
If you are trying to work exclusively with mysql you might be able to use the local filesystem ("file://) as the cluster FS. However, if any RDD or stage in a spark query does try to use a shared filesystem as a way of committing work, the output isn't going to propagate from the workers (which will have written to their local filesystem) and the spark driver (which can only read its local filesystem)

Cassandra + Spark executor hyperconvergence

As Apache Spark is a suggested distributed processing engine for Cassandra, I know that there is a possibility to run Spark executors along with Cassandra nodes.
My question is if the driver and Spark connector are smart enough to understand partitioning and shard allocation so data are processed in a hyper-converged manner.
Simply, does the executors read data stored from partitions that are hosted on nodes where an executor is running so no unnecessary data are transferred across the network as Spark does when it's run over HDFS?
Yes, Spark Cassandra Connector is able to do this. From the source code:
The getPreferredLocations method tells Spark the preferred nodes to fetch a partition from, so that the data for the partition are at the same node the task was sent to. If Cassandra nodes are collocated with Spark nodes, the queries are always sent to the Cassandra process running on the same node as the Spark Executor process, hence data are not transferred between nodes. If a Cassandra node fails or gets overloaded during read, the queries are retried to a different node.
Theoretically yes. Same for HDFS too. Howevet practically I have seen less of it on the cloud where separate nodes are used for spark and Cassandra when their cloud services are used. If you use IAsAS and setup your own Cassandra and Spark then you can achieve it.
I would like to add to Alex's answer:
Yes, Spark Cassandra Connector is able to do this. From the source
code:
The getPreferredLocations method tells Spark the preferred nodes to
fetch a partition from, so that the data for the partition are at the
same node the task was sent to. If Cassandra nodes are collocated with
Spark nodes, the queries are always sent to the Cassandra process
running on the same node as the Spark Executor process, hence data are
not transferred between nodes. If a Cassandra node fails or gets
overloaded during read, the queries are retried to a different node.
That this is a bad behavior.
In Cassandra when you ask to get the data of a particular partition, only one node is accessed. Spark can actually access 3 nodes thanks to the replication. So without shuffeling you have 3 nodes participating in the job.
In Hadoop however, when you ask to get the data of a particular partition, usually all nodes in the cluster are accessed and then Spark uses all nodes in the cluster as executors.
So in case you have a 100 nodes: In Cassandra, Spark will take advantage of 3 nodes. In Hadoop, Spark will take advantage of a 100 nodes.
Cassandra is optimized for real-time operational systems, and therefore not optimized for analytics like data lakes.

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called

How many connections will be build btween spark and hdfs when sc.textFile("hdfs://.....") is called. The file on hdfs is very large(100G).
Actually, the main idea behind the distributed systems and of course which is designed and implemented in hadoop and spark is to send the process to data. In other words, imagine that there is some data located on hdfs data nodes on our cluster and we have a job which utilizes that data on the same worker. On each machine, you would have a data node and is a spark worker at the same time and may have some other processes like hbase region server too. When an executor is executing one of the scheduled tasks, it retrieves its needed data from the underlying data node. Then for each individual task you would retrieve its data and so you may describe this as one connection to hdfs on its local data node.

3 nodes cassandra with one being a spark master - to solve geospatial data or geographic data

I am looking for directions:
I have a cassandra database with latitude & longitude data. I need to search for data within a radius or a box coordinates around a point. I am using golang(gocql) client to query Cassandra.
I need some understanding regarding Spark and Cassandra as this seams like the way to go.
Is the following assumptions correct; I have 2 Cassandra nodes(the data in a replica of 2).
Should I then install an extra node and install Spark on this and then connect it to the other two existing Cassandra nodes containing the data(With the Spark Connector from DataStax).
And do the two existing Cassandra nodes need to have Spark workers installed on them to work with Spark Master node?
When the Spark setup is in place, do you query(Scala) the existing data and then save the data onto the Spark node and then query this with the gaoling(gocql) client?
Any directions is welcome
Thanks in advance
Geospatial Searching is a pretty deep topic. If it's just doing searches that you're after (not batch/analytics), I can tell you that you probably don't want to use Spark. Spark isn't very good at 'searching' for data - even when it's geospatial. The main reason is that Spark doesn't index data for efficient searches and you'd have to create a job/context (unless using job server) every time you'd want to do a search. That takes forever when you're thinking in terms of user facing application time.
Solr, Elastic Search, and DataStax Enterprise Search (Disclaimer I work for DataStax) are all capable of box and radius searches on Cassandra data and do so in near real time.
To answer your original question though, if the bulk of your analytics in general come from Cassandra data, it may be good idea to run Spark on the same nodes as Cassandra for data locality. The nice thing is that Spark scales quite nicely, so if you find Spark taking too many resources from Cassandra, you can simply scale out (both Cassandra and Spark).
Should I then install an extra node and install Spark on this and then
connect it to the other two existing Cassandra nodes containing the
data(With the Spark Connector from DataStax).
Spark is a cluster compute engine so it needs a cluster of nodes to work well. You'll need to install it on all nodes if you want it to be as efficient as possible.
And do the two existing Cassandra nodes need to have Spark workers
installed on them to work with Spark Master node?
I don't think they 'have' to have them, but it's a good idea for locality. There's a really good video on academy.datastax.com that shows how the spark cassandra connector reads data from Cassandra to Spark. I think it will clear a lot of things up for you: https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data
When the Spark setup is in place, do you query(Scala) the existing
data and then save the data onto the Spark node and then query this
with the gaoling(gocql) client?
The Spark-Cassandra connector can communicate to both Cassandra and Spark. There are methods, saveToCassandra(), for example, that will write data back to Cassandra your jobs are processed. Then you can use your client as you normally would.
There are some really good free Spark + Cassandra tutorials at academy.datastax.com. This is also a good place to start: http://rustyrazorblade.com/2015/01/introduction-to-spark-cassandra/

Resources