Using cassandra's ttl() in where clause - cassandra

I'd like to ask if its possible to get rows from cassandra, that have ttl(time to live) bigger than 0. So in the next step i can update those rows with ttl 0. The goals is basically to change the ttl of all the columns for every entry in db to 0.
I've tried SELECT * FROM table where ttl(column1) > 0, but it seems its not possible to use ttl() function in where clause.
I also found a way where we can export all the rows to csv, delete the data in our table and import them again from csv with new ttl. That works but its dangerous because we have over million entries on production and we do not know how it will behave.

You can't do this with CQL only - you need to have support from some tool, for example:
DSBulk - you can unload all your data into CSV file, and load back with new TTL set (if you set it to 0, then just load data back). Here is a blog post that shows how to use DSBulk with TTL. But you can't have condition on the TTL, that's why you need to unload all your data
Spark with Spark Cassandra Connector (even in the local master mode). Version 2.5.0 supports TTL in the Dataframe API (earlier versions supported it only for RDD API) - for Spark 2.4 you need to correctly register functions. This could be done one time, directly in the spark-shell with something like this (you need to adjust your columns in the select & filter statements):
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load
val ttlData = data.select(ttl("col1").as("col_ttl"), $"col2", $"col3").filter($"col_ttl" > 0)
ttlData.drop("col_ttl").write.cassandraFormat("table", "keyspace").mode("append").save

Related

Cassandra 3.7 CDC / incremental data load

I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

Mass insert on Cassandra from Spark with different TTL

I want to insert huge volume of data from Spark into Cassandra. The data has a timestamp column which determines ttl. But, this differs for each row. My question is, how can I handle ttl while inserting data in bulk from Spark.
My current implementation -
raw_data_final.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Overwrite).options(Map("table" -> offerTable ,
"keyspace" -> keySpace, "spark.cassandra.output.ttl" -> ttl_seconds)).save
Here raw_data_final has around a million records with each record yielding a different ttl. So, is there a way to do a bulk insert and somehow specify ttl from a column within raw_data.
Thanks.
This is supported by setting the WriteConf parameter with TTLOption.perRow option. The official documentation has following example for RDDs:
import com.datastax.spark.connector.writer._
...
rdd.saveToCassandra("test", "tab", writeConf = WriteConf(ttl = TTLOption.perRow("ttl")))
In your case you need to replace "ttl" with the name of your column with TTL.
I'm not sure that you can set this directly on DataFrame, but you can always get RDD from DataFrame, and use saveToCassandra with WriteConf...
Update in September 2020th: support for writetime and ttl in dataframes was added in the Spark Cassandra Connector 2.5.0

PrestoDB v0.125 SELECT only returns subset of Cassandra records

SELECT statements in PrestoDB v0.125 with a Cassandra connector to a Datastax Cassandra cluster only return 200 rows, even where table contains many more rows than that. Aggregate queries like SELECT COUNT() over the same table also return a result of just 200.
(This behaviour is identical when querying with pyhive connector & with base presto CLI).
Documentation isn't much help, but am guessing that the issue is pagination & a need to set environment variables (which the documentation doesn't explain):
https://prestodb.io/docs/current/installation/cli.html
Does anyone know how I can remove this limit of 200 rows returned? What specific environment variable setting do I need?
For those who come after - the solution is in the cassandra.properties connector configuration for presto. The key setting is:
cassandra.limit-for-partition-key-select
This needs to be set higher than the total number of rows in the table you are querying, otherwise select queries will respond with only a fraction of the stored data (not having located all of the partition keys).
Complete copy of my config file (which may help!):
connector.name=cassandra
# Comma separated list of contact points
cassandra.contact-points=host1,host2
# Port running the native Cassandra protocol
cassandra.native-protocol-port=9042
# Limit of rows to read for finding all partition keys.
cassandra.limit-for-partition-key-select=2000000000
# maximum number of schema cache refresh threads, i.e. maximum number of parallel requests
cassandra.max-schema-refresh-threads=10
# schema cache time to live
cassandra.schema-cache-ttl=1h
# schema refresh interval
cassandra.schema-refresh-interval=2m
# Consistency level used for Cassandra queries (ONE, TWO, QUORUM, ...)
cassandra.consistency-level=ONE
# fetch size used for Cassandra queries
cassandra.fetch-size=5000
# fetch size used for partition key select query
cassandra.fetch-size-for-partition-key-select=20000

Spark Cassandra connector - where clause

I am trying to do some analytics on time series data stored in cassandra by using spark and the new connector published by Datastax.
In my schema the Partition key is the meter ID and I want to run spark operations only on specifics series, therefore I need to filter by meter ID.
I would like then to run a query like: Select * from timeseries where series_id = X
I have tried to achieve this by doing:
JavaRDD<CassandraRow> rdd = sc.cassandraTable("test", "timeseries").select(columns).where("series_id = ?",ids).toJavaRDD();
When executing this code the resulting query is:
SELECT "series_id", "timestamp", "value" FROM "timeseries" WHERE token("series_id") > 1059678427073559546 AND token("series_id") <= 1337476147328479245 AND series_id = ? ALLOW FILTERING
A clause is automatically added on my partition key (token("series_id") > X AND token("series_id") <=Y) and then mine is appended after that. This obviously does not work and I get an error saying: "series_id cannot be restricted by more than one relation if it includes an Equal".
Is there a way to get rid of the clause added automatically? Am I missing something?
Thanks in advance
The driver automatically determines the partition key using table metadata it fetches from the cluster itself. It then uses this to append the token ranges to your CQL so that it can read a chunk of data from the specific node it's trying to query. In other words, Cassandra thinks series_id is your partition key and not meter_id. If you run a describe command on your table, I bet you'll be surprised.

Resources