spark-cassandra-connector how the read work - apache-spark

I have a question about this connector. If my Spark cluster and my Cassandra cluster are not on the same cluster, how does the READ work? Does Spark bring the entire Cassandra table into its own cluster and rearrange it into Spark partition?

push down operations are available between spark and cassandraas long as you filter early, cassandra will conduct all filters so that you ship over network already filtered data, have a read: tips cassandra-spark

Related

What is best approach to join data in spark streaming application?

Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ?
We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version.
But have one fundamental question regarding the efficiency in the below scenario.
For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing records( i.e. cassandraDataset) from Cassandra(C*) table.
i.e.
Dataset<Row> streamingDataSet = //kafka read dataset
Dataset<Row> cassandraDataset= //loaded from C* table those records loaded earlier from above.
To look up data i need to join above datasets
i.e.
Dataset<Row> joinDataSet = cassandraDataset.join(cassandraDataset).where(//somelogic)
process further the joinDataSet to implement the business logic ...
In the above scenario, my understanding is ,for each record received
from kafka stream it would query the C* table i.e. data base call.
Does not it take huge time and network bandwidth if C* table consists
billions of records? What should be the approach/procedure to be
followed to improve look up C* table ?
What is the best solution in this scenario ? I CAN NOT load once from
C* table and look up as the data keep on adding to C* table ... i.e.
new look ups might need newly persisted data.
How to handle this kind of scenario? any advices plzz..
If you're using Apache Cassandra, then you have only one possibility for effective join with data in Cassandra - via RDD API's joinWithCassandraTable. The open source version of the Spark Cassandra Connector (SCC) supports only it, while in version for DSE, there is a code that allows to perform effective join against Cassandra also for Spark SQL - so-called DSE Direct Join. If you'll use join in Spark SQL against Cassandra table, Spark will need to read all data from Cassandra, and then perform join - that's very slow.
I don't have an example for OSS SCC for doing the join for Spark Structured Streaming, but I have some examples for "normal" join, like this:
CassandraJavaPairRDD<Tuple1<Integer>, Tuple2<Integer, String>> joinedRDD =
trdd.joinWithCassandraTable("test", "jtest",
someColumns("id", "v"), someColumns("id"),
mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));

What is the right approach to upload the data into HBase from Apache Spark?

I am working on writing a Spark job which reads the data from the Hive and store in HBase for real time access. The executor makes the connection with HBase, what is the right approach to insert the data into. I have thought of following two approaches.
Which one is more appropriate or is there any other approach?
Write data directly from Spark Job to Hbase
Write data from Spark to HDFS and later move it to Hbase

How does spark copy data between cassandra tables?

Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?
Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.

Live sync of Spark RDD from Cassandra table

I'm looking for a way to keep a Spark RDD in sync with a Cassandra table. I know it is possible to load a full Cassandra table into an RDD as a one shot operation but would like to keep the RDD synchronized with updates happening to the Cassandra table.
This will allow to not reload the full table into Spark everytime I need to get fresh data into Spark (which can be long if the table is big).
Any hint ?

Retrieve Cassandra partition data in Apache Spark

I have my data well organized by partition key on Cassandra. I would like to retrieve this data in Spark and keep the same partitions.
My goal is to avoid a very large shuffle.
PS : I am using Cassandra 2.1 and Spark 1.5
The Spark Cassandra Connector reads C* Token Ranges into Spark Partitions. This means all of the values for any given Cassandra Partition key will be in the same Spark Partition.
https://academy.datastax.com/demos/how-spark-cassandra-connector-reads-data

Resources