Increase request timeout for CQL from NiFi - cassandra

I am using QueryCassandra processor in NiFi to fetch data from Cassandra but my query is getting timedoutexception. I want to increase the request time out while running the CQL query from the processor. Is there a way to do that or I will have to write a custom processor?

Most probably you're getting an exception because you're performing query on non-partition key - in this case, the query is distributed to the all nodes, and requires to go through all available data, and this is very slow if you have big data set.
In Cassandra queries are fast only when you're performing them on (at least) partition key. If you need to search on non-partition column, then you need to re-model your tables to match your queries. I recommend to take DS220 course on DataStax Academy for better understanding how Cassandra works.

As #Alex ott said, it is not recommended to query on non partition key. If you still want to do so and increase the timeout for the query, just property Max Wait Time to whatever timeout you want.
EDIT:
tl;dr: Apache's timeout wrapper doesn't really let you use the timeout option.
Now that you mentioned that this is a DataStax exception and not java.util.concurrent.TimeoutException, I can tell you that I've looked into QueryCassandra processor's source code and it seems like Apache just wrapped the query function with a Future to achieve a timeout instead of using DataStax's built-in timeout option. This results in a default non-changeable timeout by the DataStax driver. It should be reported to Apache as a bug.

Related

Cassandra query language(CQL): Can we combine SELECT * FROM table_name; and SELECT count(*) FROM table_name;

I am trying to find out if we can combine the 'Select *' and 'Select count(*)' query together in CQL and if yes, then how about the performance, does it increase the execution performance of the query than running 2 separate queries on large dataset.
Short answer - no, you can't combine these two queries.
Longer answer - don't do that. On the big datasets your query most probably will timeout because Cassandra will need to go through all data on all machines. This will lead to increased load to coordinator node, and potentially crash it. If you need to fetch all data you need to use another approach:
use Spark Cassandra Connector that is heavily optimized for such tasks
if you want to offload data, or just count, you can use DSBulk utility for that
if you still need to do that from the code, you need to perform so-called token range scanning - technology that is used by Spark Cassandra Connector and DSBulk. You can either implement that yourself (here is an example for Java driver 3.x), or for Java use DSBulk's API - the jars are available via Maven (I don't have example for that)

Cassandra unpredictable failure depending on WHERE clause

I am attempting to execute a SELECT statement against a large Cassandra table (10m rows) with various WHERE clauses. I am issuing these from the Datastax DevCenter application. The columns I am using in the where clause have secondary indexes.
The where clause looks like WHERE fileid = 18000 or alternatively WHERE fileid < 18000. In this example, the second where clause results in the error Unable to execute CQL script on 'connection1': Cassandra failure during read query at consistency ONE (1 responses were required but only 0 replica responded, 1 failed)
I have no idea why it is failing in this unpredictable manner. Any ideas?
NOTE: I am aware that this is a terrible idea, and Cassandra is not meant to be used in this way. I am issuing these queries and timing them to prove to others how inefficient Cassandra is for our use case compared to other solutions.
Your query is probably failing because of a READ timeout (the timeout on waiting to read data). You could try updating the Cassandra.yaml with a larger read timeout time with read_request_timeout_in_ms: 200000 (for 200s) to give an output rather than an error. However, if you're trying to prove the inefficiency of Cassandra in your use case, this error seems like a pretty good way to do it.

Cassandra datastax OperationTimedOutException

I'm using a 3 nodes cassandra 3.0.14 deployed on 3 differents VM. I have a lots of data (billions) and I would like to make quick search among my Cassandra architecture.
I've made a lots of research on Cassandra but I'm still facing some issues that I cannot understand:
When I am using cqlsh I can make a query that analyzes all my database
SELECT DISTINCT val_1 FROM myTable; is working.
However I cannot make the same request using my java code and datastax driver. My script return:
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/XX.XX.XX.XX:9042] Timed out waiting for server response
Some request are working using cqlsh but making a more specific request will lead to a request timeout:
OperationTimedOut: errors={'127.0.0.1': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.0.0.1
For example if I'm making this request:
SELECT val_1 FROM myTable where time>'2018-09-16 09:00:00'; will work
SELECT val_1 FROM myTable where time>'2018-09-16 09:00:00' and time<'2018-09-17 09:00:00'; will lead to time out
I changed my request_timeout_in_ms to 60s but I know it is not a good practice. I also increase my read_request_timeout_in_ms and range_request_timeout_in_ms but I still have the previous issues.
Would anyone have the same problems ?
-Nicolas
Try to adjust the client timeout in Java code, as follows:
//configure socket options
SocketOptions options = new SocketOptions();
options.setConnectTimeoutMillis(30000);
options.setReadTimeoutMillis(300000);
options.setTcpNoDelay(true);
//spin up a fresh connection (using the SocketOptions set up above)
cluster = Cluster.builder().addContactPoint(Configuration.getCassandraHost()).withPort(Configuration.getCassandraPort())
.withCredentials(Configuration.getCassandraUser(), Configuration.getCassandraPass()).withSocketOptions(options).build();
This happens because you're using Cassandra incorrect way. The range operations, distinct, etc. works best only if you have the partition key specified in your query. Otherwise, Cassandra will need to scan whole cluster trying to find the data that you need, and this will lead to timeout even on the medium-sized database. Don't use the ALLOW FILTERING to enforce execution of the queries.
In Cassandra, database structure is modeled around the queries that you want to execute. I recommend to take DS201 & DS220 courses from the DataStax Academy.

Cassandra vs Cassandra+Ignite

(Single Node Cluster)I've got a table having 2 columns, one is of 'text' type and the other is a 'blob'. I'm using Datastax's C++ driver to perform read/write requests in Cassandra.
The blob is storing a C++ structure.(Size: 7 KB).
Since I was getting lesser than desirable throughput when using Cassandra alone, I tried adding Ignite on top of Cassandra, in the hope that there will be significant improvement in the performance as now the data will be read from RAM instead of hard disks.
However, it turned out that after adding Ignite, the performance dropped even more(roughly around 50%!).
Read Throughput when using only Cassandra: 21000 rows/second.
Read Throughput with Cassandra + Ignite: 9000 rows/second.
Since, I am storing a C++ structure in Cassandra's Blob, the Ignite API uses serialization/de-serialization while writing/reading the data. Is this the reason, for the drop in the performance(consider the size of the structure i.e. 7K) or is this drop not at all expected and maybe something's wrong in the configuration?
Cassandra: 3.11.2
RHEL: 6.5
Configurations for Ignite are same as given here.
I got significant improvement in Ignite+Cassandra throughput when I used serialization in raw mode. Now the throughput has increased from 9000 rows/second to 23000 rows/second. But still, it's not significantly superior to Cassandra. I'm still hopeful to find some more tweaks which will improve this further.
I've added some more details about the configurations and client code on github.
Looks like you do one get per each key in this benchmark for Ignite and you didn't invoke loadCache before it. In this case, on each get, Ignite will go to Cassandra to get value from it and only after it will store it in the cache. So, I'd recommend invoking loadCache before benchmarking, or, at least, test gets on the same keys, to give an opportunity to Ignite to store keys in the cache. If you think you already have all the data in caches, please share code where you write data to Ignite too.
Also, you invoke "grid.GetCache" in each thread - it won't take a lot of time, but you definitely should avoid such things inside benchmark, when you already measure time.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Resources