I am a new learner of Spark and Cassandra. I am facing a major performance issue.I am streaming the data from Kafka at every 5 seconds in Spark, then perform analytic on the data in R language using JRI and finally saving the data to Cassandra's respective column family. The time duration(in milliseconds) for saving the data to Cassandra increases very rapidly with the number of input requests [each request is 200KB].
Spark code:
sessionData.foreachRDD(new Function<JavaRDD<NormalizedData>, Void>() {
public Void call(JavaRDD<NormalizedData> rdd) {
System.out.println("step-3 " + System.currentTimeMillis());
javaFunctions(rdd).writerBuilder("keyspace",normalized_data",mapToRow(NormalizedData.class)).saveToCassandra();
System.out.println("step-4 " + System.currentTimeMillis());}}
I was able to enhance performance for the same by using Spark and Cassandra on the same server. This delay was because Spark and Cassandra were on different server though in same region on AWS. The network delay was the main cause as it impacted data locality. Thanks.
You can refer to this blog for Spark-Cassandra connector tuning. You will get an idea on perf numbers that you can expect. Also You can try out another open source product SnappyData, which is the Spark database, which will give you very high performance in your use case.
I am also using Cassandra Spark combination to do realtime analytics. The following things are a few best practices:
Data Locality - Running Cassandra daemon with Worker node in case of Spark standalone or Node Manager in case of Yarn], Mesos worker in case of Mesos
Increase the parallelism i.e., create more partitions/tasks
Use Cassandra Connection Pooling to improve throughput
In your case, you are using JRI to call R in side Java. This is a bit
slowly and performance overhead. So use SparkR to integrate R with Spark instead of JRI directly.
Related
I am using Apache Spark 3.1 with java in GCP Dataproc cluster. And my code structure is like this.
Dataset<Row> dataset1 = readSpannerData(SparkSession session, Configuration session.sessionState().newHadoopConf());
Dataset<Row> dataset2 = reading some data from table1 bigtable
Dataset<Row> result1 = dataset1.join(dataset2);
dataset1.persist(StorageLevel.MEMORY_AND_DISK());
dataset2.persist(StorageLevel.MEMORY_AND_DISK()); //once the usage is done I am persisting both datasets
System.out.println(result1.count()); // It throws error in this line
The exact error from YARN UI is, select query on spanner table which I am using in the starting of the job, not from any bigtable. I persisted dataset1 only after the usage is done.
And my cluster size is autoscale enabled with max of 250 worker nodes each have 8 core and 1024GB memory.
It is configured to use 2 Executors on each node.(4 cores on each exe).
It was working fine with low volume of data. But it throws error while running with huge data.
Why it throws error in this situation? Will it look into the parent in memory dataset while using the result, calculated from the parent dataset which is already persisted? If we want to maintain that dataset then what is the usage of IN-Memory storage?
How it is working in low data environments? Howmany nodes and how long IN-MEMORY dataset will be maintained in spark job? Will the volume of the data affect IN-MEMORY dataset?
Can any one clarify this doubt?
Thanks In Advance :)
Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?
Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.
Edit
My use case is a Spark streaming app (spark 2.1.1 + Kafka 0.10.2.1), wherein I read from Kafka and for each message/trigger need to pull data from HBase. post the pull, I need to run some SQL statements on the data (so received from HBase)
Naturally, I intend to push the processing (read from HBase & SQL execution) to the worker nodes to achieve parallelism.
So far, my attempts to convert the data from HBase to a data frame (so that i can launch SQK statements) are failing. Another gent mentioned that it's not "allowed " since that part is running on executors. However, this is my conscious choice to run those pieces on worker nodes.
Is that sound thinking? If not, why not?
What's the recommendation on that? or on the overall idea?
For every streamed rec, reading from hbase and sql seems to be "too much happening in streaming app".
Anyways, you can create connection for every partition to hbase and get records and then compare. Not sure about sql. If its just another reading for every streaming record, again handle at partition level in spark.
But the above approach will be time consuming - just make sure you finish all stuff before the next batch starts.
You also mentioned converting "hbase to dataframe" and "parallel". Both seemed to be in opposite direction. Because you start with dataframe(may be reading from hbase once and then you parallelize. Hope I cleared some of your doubts
What is a better choice for a long-term store (many writes, few reads) of data processed through Spark Streaming: Parquet, HBase or Cassandra? Or something else? What are the trade-offs?
In my experience we have used Hbase as datastore for spark streaming data(we also has same scenario many writes and few reads), since we are using hadoop, hbase has native integration with hadoop and it went well..
Above we have used tostore hight rate of messages coming over from solace.
HBase is well suited for doing Range based scans. Casandra is known for availablity and many other things...
However, I can also observe one general trend in many projects, they are simply storing rawdata in hdfs (parquet + avro) in partitioned structure through spark streaming with spark dataframe(SaveMode.Append) and they are processing rawdata with Spark
Ex of partitioned structure in hdfs :
completion ofbusinessdate/environment/businesssubtype/message type etc....
in this case there is no need for going to Hbase or any other data store.
But one common issue in above approach is when you are getting small and tiny files, through streaming then you would need to repartion(1) or colelese or FileUtils.copymerge to meet block size requirements to single partitioned file. Apart from that above approach also would be fine.
Here is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time).
Availability (every request receives a response about whether it
succeeded or failed).
Partition tolerance (the system continues to
operate despite arbitrary partitioning due to network failures)
Casandra supports AP.
Hbase supports CP.
Look at detailed analysis given here
I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.