How does spark copy data between cassandra tables? - apache-spark

Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?

Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.

Related

How persist(StorageLevel.MEMORY_AND_DISK) works in Spark 3.1 with Java implementation

I am using Apache Spark 3.1 with java in GCP Dataproc cluster. And my code structure is like this.
Dataset<Row> dataset1 = readSpannerData(SparkSession session, Configuration session.sessionState().newHadoopConf());
Dataset<Row> dataset2 = reading some data from table1 bigtable
Dataset<Row> result1 = dataset1.join(dataset2);
dataset1.persist(StorageLevel.MEMORY_AND_DISK());
dataset2.persist(StorageLevel.MEMORY_AND_DISK()); //once the usage is done I am persisting both datasets
System.out.println(result1.count()); // It throws error in this line
The exact error from YARN UI is, select query on spanner table which I am using in the starting of the job, not from any bigtable. I persisted dataset1 only after the usage is done.
And my cluster size is autoscale enabled with max of 250 worker nodes each have 8 core and 1024GB memory.
It is configured to use 2 Executors on each node.(4 cores on each exe).
It was working fine with low volume of data. But it throws error while running with huge data.
Why it throws error in this situation? Will it look into the parent in memory dataset while using the result, calculated from the parent dataset which is already persisted? If we want to maintain that dataset then what is the usage of IN-Memory storage?
How it is working in low data environments? Howmany nodes and how long IN-MEMORY dataset will be maintained in spark job? Will the volume of the data affect IN-MEMORY dataset?
Can any one clarify this doubt?
Thanks In Advance :)

Read from Hbase + Convert to DF + Run SQLs

Edit
My use case is a Spark streaming app (spark 2.1.1 + Kafka 0.10.2.1), wherein I read from Kafka and for each message/trigger need to pull data from HBase. post the pull, I need to run some SQL statements on the data (so received from HBase)
Naturally, I intend to push the processing (read from HBase & SQL execution) to the worker nodes to achieve parallelism.
So far, my attempts to convert the data from HBase to a data frame (so that i can launch SQK statements) are failing. Another gent mentioned that it's not "allowed " since that part is running on executors. However, this is my conscious choice to run those pieces on worker nodes.
Is that sound thinking? If not, why not?
What's the recommendation on that? or on the overall idea?
For every streamed rec, reading from hbase and sql seems to be "too much happening in streaming app".
Anyways, you can create connection for every partition to hbase and get records and then compare. Not sure about sql. If its just another reading for every streaming record, again handle at partition level in spark.
But the above approach will be time consuming - just make sure you finish all stuff before the next batch starts.
You also mentioned converting "hbase to dataframe" and "parallel". Both seemed to be in opposite direction. Because you start with dataframe(may be reading from hbase once and then you parallelize. Hope I cleared some of your doubts

Spark Cassandra Performance Issue

I am a new learner of Spark and Cassandra. I am facing a major performance issue.I am streaming the data from Kafka at every 5 seconds in Spark, then perform analytic on the data in R language using JRI and finally saving the data to Cassandra's respective column family. The time duration(in milliseconds) for saving the data to Cassandra increases very rapidly with the number of input requests [each request is 200KB].
Spark code:
sessionData.foreachRDD(new Function<JavaRDD<NormalizedData>, Void>() {
public Void call(JavaRDD<NormalizedData> rdd) {
System.out.println("step-3 " + System.currentTimeMillis());
javaFunctions(rdd).writerBuilder("keyspace",normalized_data",mapToRow(NormalizedData.class)).saveToCassandra();
System.out.println("step-4 " + System.currentTimeMillis());}}
I was able to enhance performance for the same by using Spark and Cassandra on the same server. This delay was because Spark and Cassandra were on different server though in same region on AWS. The network delay was the main cause as it impacted data locality. Thanks.
You can refer to this blog for Spark-Cassandra connector tuning. You will get an idea on perf numbers that you can expect. Also You can try out another open source product SnappyData, which is the Spark database, which will give you very high performance in your use case.
I am also using Cassandra Spark combination to do realtime analytics. The following things are a few best practices:
Data Locality - Running Cassandra daemon with Worker node in case of Spark standalone or Node Manager in case of Yarn], Mesos worker in case of Mesos
Increase the parallelism i.e., create more partitions/tasks
Use Cassandra Connection Pooling to improve throughput
In your case, you are using JRI to call R in side Java. This is a bit
slowly and performance overhead. So use SparkR to integrate R with Spark instead of JRI directly.

All host(s) tried for query failed - com.datastax.driver.core.OperationTimedOutException

While performing Cassandra operations (Batch execution- insert and update operations on two tables) using spark job I am getting "All host(s) tried for query failed - com. datastax. driver. core. OperationTimedOutException" error.
Cluster information:
Cassandra 2.1.8.621 | DSE 4.7.1
spark-cassandra-connector-java_2.10 version - 1.2.0-rc1 | cassandra-driver-core version - 2.1.7
Spark 1.2.1 | Hadoop 2.7.1 => 3 nodes
Cassandra 2.1.8 => 5 nodes
Each node having 28 gb memory and 24 cores
While searching for it's solution I came across some discussions, which says you should not use BATCHES. Though I would like to find the root cause of this error.Also, How and from where to set/get "SocketOptions. setReadTimeout", as this timeout limit must be greater than the Cassandra requests timeout as per standard guideline and to avoid possible errors.
Is the request_timeout_in_ms and the SocketOptions. setReadTimeout same?Can anyone help me with this?
While performing Cassandra operations (Batch execution- insert and
update operations on two tables) using spark job I am getting "All
host(s) tried for query failed - com. datastax. driver. core.
OperationTimedOutException" error.
Directly from the docs:
Why are my write tasks timing out/ failing?
The most common cause of this is that Spark is able to issue write requests much more quickly than Cassandra can handle them. This can lead to GC issues and build up of hints. If this is the case with your application, try lowering the number of concurrent writes and the current batch size using the following options.
spark.cassandra.output.batch.size.rows spark.cassandra.output.concurrent.writes
or in versions of the Spark Cassandra Connector greater than or equal to 1.2.0 set
spark.cassandra.output.throughput_mb_per_sec
which will allow you to control the amount of data written to C* per Spark core per second.
you should not use BATCHES
This is not always true, the connector uses local token aware batches for faster reads and writes but this is tricky to get right in a custom app. In many cases async queries are better or just as good.
setReadTimeout
This is a DataStax java driver method. The connector takes care of this for you, no need to change it.

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Resources