Cassandra: How to create a UDF performing queries with volatile tables like SQL? - cassandra

I am looking for examples of UDFs in Cassandra. So far, the examples I have found are doing some computation and returning the result. I am looking for how to perform multiple database queries, store the result of some query in an intermediate table or variable. Then use the value of that variable to perform another query. Not sure how to do that. Can anyone help ?
PS: I am using the Node.js driver.

It's not possible with Cassandra alone. You can do it together with Spark and push the intermediate tables back to the database.

Related

"LIKE" and "ILIKE" queries in AWS Keyspaces

As there is no support for custom index in AWS Keyspaces what would be the best solution / pattern to be able to run LIKE or ILIKE queries on specific columns of a Cassandra Table?
In vanilla Cassandra, you can use SSTable secondary index to use LIKE queries, but we can't in AWS...
Is there any query for Cassandra as same as SQL:LIKE Condition?
Feeding an OpenSearch service, or even a good old Postgres at the same time of updating Keyspaces seems a bit overkill to me.
Fetching all columns in-memory somewhere to do the query seems slow as well.
What would be the lightest infra / architecture to implement to provide a LIKE query support based on AWS Keyspaces as source of truth?
You can use a Lexi-graphical Select statement to narrow your query down the same way you would do a LIKE statement. If you needed to further narrow it down you could do that narrowing client side. I would love to learn more your use case so I can better assist you.

What is your approach for querying Cassandra with Spark (in R or Python)?

I am working with about a TB of data stored in Cassandra and trying to query it using Spark and R (could be Python).
My preference for querying the data would be to abstract the Cassandra table I'm querying from as a Spark RDD (using sparklyr and the spark-cassandra-connector with spark-sql) and simply doing an inner join on the column of interest (it is a partition key column). The company I'm working with says that this approach is a bad idea as it will translate into an IN clause in CQL and thus cause a big slow-down.
Instead I'm using their preferred method: write a closure that will extract the data for a single id in the partition key using a jdbc connection and then apply that closure 200k times for each id I'm interested in. I use spark_apply to apply that closure in parallel for each executor. I also set my spark.executor.cores to 1 so I get a lot of parellelization.
I'm having a lot of trouble with this approach and am wondering what the best practice is. Is it true that Spark SQL does not account for the slowdown associated with pulling multiple ids from a partition key column (IN operator)?
A few points here:
Working with Spark-SQL is not always the most performant option, the
optimized might not always as good of a job than a job you write
yourself
Check the logs carefully during your work, always check how your high-level queries are translated to CQL queries. In particular, make sure you avoid a full table scan if you can.
If you joining on the partition key, you should look into leveraging the methods: repartitionByCassandraReblica, and joinWithCassandraTable. Have a look at the official doc here: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md and Tip4 of this blog post: https://www.instaclustr.com/cassandra-connector-for-spark-5-tips-for-success/
Finale note, it's quite common to have 2 Cassandra data center when using Spark. The first one serves regular read / write, the second one is used for running Spark. It's a separation of concern best practice (at the cost of an additional DC of course).
Hope it helps!

Does registerTempTable cause the table to get cached?

I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table? I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
SQL is most likely going to be the fastest by far Databricks blog
Did you try to partition/repartition your dataframe as well to see whether it improves the performance?
Regarding registerTempTable: it only registers the table within a spark context. You can check with the UI.
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test")
test.show()
Storage is blank
vs
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test").cache()
test.show()
by the way registerTempTable is deprecated in Spark 2.0 and has been replaced by
createOrReplaceTempView
I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table?
The registerTempTabele or createOrReplaceTempView doesn't cache the data into memory or disc itself unless you use cache() function.
I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
Keep in mind the sql terms in sql query ultimately call the function inside. so whether you use sql query terms or functions available in code it doesn't matter. that is same thing.

How to migrate data between two tables in Cassandra properly

I have to change the schema of one of my tables in Cassandra. It's cannot be done by simply using ALTER TABLE command, because there are some changes in primary key.
So the question is: How to do such a migration in the best way?
Using COPY command in cql is not an option in here because dump file can be really huge.
Can I solve this problem by not creating some custom application?
Like Guillaume has suggested in the comment - you can't do this directly in cassandra. Schema altering operations are very limited here. You have to perform such migration manually using one of suggested there tools OR if you have very large tables you can leverage Spark.
Spark can efficiently read data from your nodes, transform them locally and save them back to db. Remember that such migration requires reading whole db content, so might take a while. It might be the most performant solution, however needs some bigger preparation - Spark cluster setup.

Spark Cassandra connector - Range query on partition key

I'm evaluating spark-cassandra-connector and i'm struggling trying to get a range query on partition key to work.
According to the connector's documentation it seems that's possible to make server-side filtering on partition key using equality or IN operator, but unfortunately, my partition key is a timestamp, so I can not use it.
So I tried using Spark SQL with the following query ('timestamp' is the partition key):
select * from datastore.data where timestamp >= '2013-01-01T00:00:00.000Z' and timestamp < '2013-12-31T00:00:00.000Z'
Although the job spawns 200 tasks, the query is not returning any data.
Also I can assure that there is data to be returned since running the query on cqlsh (doing the appropriate conversion using 'token' function) DOES return data.
I'm using spark 1.1.0 with standalone mode. Cassandra is 2.1.2 and connector version is 'b1.1' branch. Cassandra driver is DataStax 'master' branch.
Cassandra cluster is overlaid on spark cluster with 3 servers with replication factor of 1.
Here is the job's full log
Any clue anyone?
Update: When trying to do server-side filtering based on the partition key (using CassandraRDD.where method) I get the following exception:
Exception in thread "main" java.lang.UnsupportedOperationException: Range predicates on partition key columns (here: timestamp) are not supported in where. Use filter instead.
But unfortunately I don't know what "filter" is...
i think the CassandraRDD error is telling that the query that you are trying to do is not allowed in Cassandra and you have to load all the table in a CassandraRDD and then make a spark filter operation over this CassandraRDD.
So your code (in scala) should something like this:
val cassRDD= sc.cassandraTable("keyspace name", "table name").filter(row=> row.getDate("timestamp")>=DateFormat('2013-01-01T00:00:00.000Z')&&row.getDate("timestamp") < DateFormat('2013-12-31T00:00:00.000Z'))
If you are interested in making this type of queries you might have to take a look to others Cassandra connectors, like the one developed by Stratio
You have several options to get the solution you are looking for.
The most powerful one would be to use Lucene indexes integrated with Cassandra by Stratio, which allows you to search by any indexed field in the server side. Your writing time will be increased but, on the other hand, you will be able to query any time range. You can find further information about Lucene indexes in Cassandra here. This extended version of Cassandra is fully integrated into the deep-spark project so you can take all the advantages of the Lucene indexes in Cassandra through it. I would recommend you to use Lucene indexes when you are executing a restricted query that retrieves a small-medium result set, if you are going to retrieve a big piece of your data set, you should use the third option underneath.
Another approach, depending on how your application works, might be to truncate your timestamp field so you can look for it using an IN operator. The problem is, as far as I know, you can't use the spark-cassandra-connector for that, you should use the direct Cassandra driver which is not integrated with Spark, or you can have a look at the deep-spark project where a new feature allowing this is about to be released very soon. Your query would look something like this:
select * from datastore.data where timestamp IN ('2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', ... , '2013-12-31')
, but, as I said before, I don't know if it fits to your needs since you might not be able to truncate your data and group it by date/time.
The last option you have, but the less efficient, is to bring the full data set to your spark cluster and apply a filter on the RDD.
Disclaimer: I work for Stratio :-) Don't hesitate on contacting us if you need any help.
I hope it helps!

Resources