Millions of device information fetching from cassandra using GOLANG Driver - cassandra

I am using Cassandra database to fetch millions of devices data based on device_id as partition key with parallel from API using GO-LANG DRIVER.
some one please guide me which driver parameters need to set which value on GO-LANG Driver ?

Have you tried looking into this code example in golang (using gocql driver), that demonstrates how to fetch millions of rows in parallel from Scylla?
It's basically a full table scan (or a large range scan) that shows Scylla's ability for great parallelism.
Read more in this post:
https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/
The actual code example:
https://github.com/scylladb/scylla-code-samples/tree/master/efficient_full_table_scan_example_code

Related

Increase request timeout for CQL from NiFi

I am using QueryCassandra processor in NiFi to fetch data from Cassandra but my query is getting timedoutexception. I want to increase the request time out while running the CQL query from the processor. Is there a way to do that or I will have to write a custom processor?
Most probably you're getting an exception because you're performing query on non-partition key - in this case, the query is distributed to the all nodes, and requires to go through all available data, and this is very slow if you have big data set.
In Cassandra queries are fast only when you're performing them on (at least) partition key. If you need to search on non-partition column, then you need to re-model your tables to match your queries. I recommend to take DS220 course on DataStax Academy for better understanding how Cassandra works.
As #Alex ott said, it is not recommended to query on non partition key. If you still want to do so and increase the timeout for the query, just property Max Wait Time to whatever timeout you want.
EDIT:
tl;dr: Apache's timeout wrapper doesn't really let you use the timeout option.
Now that you mentioned that this is a DataStax exception and not java.util.concurrent.TimeoutException, I can tell you that I've looked into QueryCassandra processor's source code and it seems like Apache just wrapped the query function with a Future to achieve a timeout instead of using DataStax's built-in timeout option. This results in a default non-changeable timeout by the DataStax driver. It should be reported to Apache as a bug.

Is there way in cassandra system tables check the counts ? where we can check the meta data of latest inserts?

i am working on migration tool oracle to cassandra , where I want to maintain a validation table with columns oracle count and cassandra count , so that i can validate the migration job,in cassandra is there any way system maintains the recently executed/inserted query count ? total count of a particular table ? is there anywhere in cassandra system tables does it store? if so what is it ? if not please suggest some way to design validation framework of data migration.
Is there way in cassandra, get the latest query inserted record count and total count of table in any system tables from where we can read the counts instead of executing the count(*) query on the tables ? does cassandra maintains the of the counts anywhere internally ?If so where we can check the meta data of latest inserts i.e which system tables?
Cassandra is distributed system and there is no place where it will collect the counts per tables. You can get some estimates from system.size_estimates, but it will say only paritions count per range, and their sizes.
For such framework as you're asking, you may need to develop custom Spark code (easiest way) that will perform counting of the rows, and other checks. Spark is highly optimized for effective data access and could be more preferable than writing the custom code.
Also, during migration, consider using consistency level greater than ONE to make sure that at least several nodes confirmed writing of the data. Although, it depends on the amount of data & timing requirements for your migration jobs.

How to run an incremental query

I am using Cassandra 3.10 and DataStax 3.1.4.
I would like to be able to run a query that returned data in sets of say 10000 records until the full dataset has been processed, the aim is to be memory efficient
You can page the data in most drivers. So for your query you will specify a fetchsize. When you get to the last fetched row in your resultset the driver will automatically fetch the next fetchsize number of rows.
Everything you need to know about the datastax java driver pager is well documented here: https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/
If you know in advance the size of your dataset (eg 10k records) the best you can do is design your tables around this dataset size, eg create a table and organize a priori your data into partitions of 10k records each.
This basically aims at matching the rule "model around your queries".

Use Cases for Spark

We have an application which the clients use to track their procurement cycle. We need to build a solution which will help the users to pull any column from any table in a particular subject area and they should be able to see all the rows of the result of this join of the tables from which the columns have been pulled. It needs to be similar to a Salesforce kind of reporting solution. We are looking at HDFS and Spark in Azure HDInsight to support these kind of querying capabilities. We would like to know if this is a valid use case for Spark. The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
Please let me know if this is something that can be done using Spark.
As per my understanding, Spark is mostly used for batch processing. If your use case is directly user-facing, then I am doubtful about using Spark because there may be better solutions(or alternate architectures). Becuase joining 500 million rows in realtime sounds crazy!
The volume of the joins of all tables can easily touch 500 million rows which will be pulled into the Spark driver memory before being displayed to the user.
This is another thing that puzzled me. Pulling all the 500 million rows into RAM of a single java process doesn't sound right because of the obvious reasons.
Updated
Just using spark for processing huge data will not be effective for realtime solutions(like your use case). But, Spark will be very effective if you are going to pre-process your data, cache the results using some other system, prepare views using the results can be served to your users. More or less similar to Lambda Architecture.
Spark on Yarn cluster to periodically process the data and generate/update the different views, a distributed storage system (preferably columnar storage systems) to cache the views, a REST API to serve the views to users.
Late reply to the question, but in case someone else is reading this in future. AWS Redshift does exactly this.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Resources