All host(s) tried for query failed - com.datastax.driver.core.OperationTimedOutException - cassandra

While performing Cassandra operations (Batch execution- insert and update operations on two tables) using spark job I am getting "All host(s) tried for query failed - com. datastax. driver. core. OperationTimedOutException" error.
Cluster information:
Cassandra 2.1.8.621 | DSE 4.7.1
spark-cassandra-connector-java_2.10 version - 1.2.0-rc1 | cassandra-driver-core version - 2.1.7
Spark 1.2.1 | Hadoop 2.7.1 => 3 nodes
Cassandra 2.1.8 => 5 nodes
Each node having 28 gb memory and 24 cores
While searching for it's solution I came across some discussions, which says you should not use BATCHES. Though I would like to find the root cause of this error.Also, How and from where to set/get "SocketOptions. setReadTimeout", as this timeout limit must be greater than the Cassandra requests timeout as per standard guideline and to avoid possible errors.
Is the request_timeout_in_ms and the SocketOptions. setReadTimeout same?Can anyone help me with this?

While performing Cassandra operations (Batch execution- insert and
update operations on two tables) using spark job I am getting "All
host(s) tried for query failed - com. datastax. driver. core.
OperationTimedOutException" error.
Directly from the docs:
Why are my write tasks timing out/ failing?
The most common cause of this is that Spark is able to issue write requests much more quickly than Cassandra can handle them. This can lead to GC issues and build up of hints. If this is the case with your application, try lowering the number of concurrent writes and the current batch size using the following options.
spark.cassandra.output.batch.size.rows spark.cassandra.output.concurrent.writes
or in versions of the Spark Cassandra Connector greater than or equal to 1.2.0 set
spark.cassandra.output.throughput_mb_per_sec
which will allow you to control the amount of data written to C* per Spark core per second.
you should not use BATCHES
This is not always true, the connector uses local token aware batches for faster reads and writes but this is tricky to get right in a custom app. In many cases async queries are better or just as good.
setReadTimeout
This is a DataStax java driver method. The connector takes care of this for you, no need to change it.

Related

How does spark copy data between cassandra tables?

Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?
Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.

Spark JDBC fetchsize option

I currently have an application which is supposed to connect to different types of databases, run a specific query on that database using Spark's JDBC options and then write the resultant DataFrame to HDFS.
The performance was extremely bad for Oracle (didn't check for all of them). Turns out it was because of the fetchSize property which is 10 rows by default for Oracle. So I increased it to 1000 and the performance gain was quite visible. Then, I changed it to 10000 but then some of the tables started failing with an out of memory issue in the executor ( 6 executors, 4G memory each, 2G driver memory ).
My questions are :
Is the data fetched by Spark's JDBC persisted in executor memory for each run? Is there any way to un-persist it while the job is running?
Where can I get more information about the fetchSize property? I'm guessing it won't be supported by all JDBC drivers.
Are there any other things that I need to take care which are related to JDBC to avoid OOM errors?
Fetch Size It's just a value for JDBC PreparedStatement.
You can see it in JDBCRDD.scala:
stmt.setFetchSize(options.fetchSize)
You can read more about JDBC FetchSize here
One thing you can also improve is to set all 4 parameters, that will cause parallelization of reading. See more here. Then your reading can be splitted into many machines, so memory usage for every of them may be smaller.
For details which JDBC Options are supported and how, you must search for your Driver documentation - every driver may have it's own behaviour
To answer #y2k-shubham's follow up question "do I pass it inside connectionProperties param", per the current docs the answer is "Yes", but note the lower-cased 's'.
fetchsize The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows). This option applies only to reading.

Spark Cassandra Performance Issue

I am a new learner of Spark and Cassandra. I am facing a major performance issue.I am streaming the data from Kafka at every 5 seconds in Spark, then perform analytic on the data in R language using JRI and finally saving the data to Cassandra's respective column family. The time duration(in milliseconds) for saving the data to Cassandra increases very rapidly with the number of input requests [each request is 200KB].
Spark code:
sessionData.foreachRDD(new Function<JavaRDD<NormalizedData>, Void>() {
public Void call(JavaRDD<NormalizedData> rdd) {
System.out.println("step-3 " + System.currentTimeMillis());
javaFunctions(rdd).writerBuilder("keyspace",normalized_data",mapToRow(NormalizedData.class)).saveToCassandra();
System.out.println("step-4 " + System.currentTimeMillis());}}
I was able to enhance performance for the same by using Spark and Cassandra on the same server. This delay was because Spark and Cassandra were on different server though in same region on AWS. The network delay was the main cause as it impacted data locality. Thanks.
You can refer to this blog for Spark-Cassandra connector tuning. You will get an idea on perf numbers that you can expect. Also You can try out another open source product SnappyData, which is the Spark database, which will give you very high performance in your use case.
I am also using Cassandra Spark combination to do realtime analytics. The following things are a few best practices:
Data Locality - Running Cassandra daemon with Worker node in case of Spark standalone or Node Manager in case of Yarn], Mesos worker in case of Mesos
Increase the parallelism i.e., create more partitions/tasks
Use Cassandra Connection Pooling to improve throughput
In your case, you are using JRI to call R in side Java. This is a bit
slowly and performance overhead. So use SparkR to integrate R with Spark instead of JRI directly.

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Cassandra bulk insert solution

I have a java program run as service , this program must insert 50k rows/s (1 row have 25 column ) to cassandra cluster.
My cluster contain 3 nodes, 1 node have 4 cpu core (core i5 2.4 ghz) , 4 gb ram.
i used Hector api, multithread, bulk insert but the performance is too low as expect (about 25k rows /s ).
Any one have suggest another solution for that. Is there cassandra support an internal bulk insert (without use Thrift).
Astyanax is a high level Java client for Apache Cassandra. Apache Cassandra is a highly available column oriented database.
Astyanax is currently in use at Netflix. Issues generally are fixed as quickly as possbile and releases done frequently.
https://github.com/Netflix/astyanax
I've had good luck creating sstables and loading them directly. There is a sstableloader
tool included in the distribution as well as a JMX interface. You can create the sstables using the SSTableSimpleUnsortedWriter class.
Details here.
The fastest way to bulk-insert data into Cassandra is sstableloader an utility provided by Cassandra in 0.8 onwards. For that you have to create sstables first which is possible with SSTableSimpleUnsortedWriter more about this is described here
Another faster way is Cassandras BulkoutputFormat for hadoop.With this we can write Hadoop job to load data to cassandra.See more on this bulkload to cassandra with hadoo

Resources