Mass insert on Cassandra from Spark with different TTL - apache-spark

I want to insert huge volume of data from Spark into Cassandra. The data has a timestamp column which determines ttl. But, this differs for each row. My question is, how can I handle ttl while inserting data in bulk from Spark.
My current implementation -
raw_data_final.write.format("org.apache.spark.sql.cassandra")
.mode(SaveMode.Overwrite).options(Map("table" -> offerTable ,
"keyspace" -> keySpace, "spark.cassandra.output.ttl" -> ttl_seconds)).save
Here raw_data_final has around a million records with each record yielding a different ttl. So, is there a way to do a bulk insert and somehow specify ttl from a column within raw_data.
Thanks.

This is supported by setting the WriteConf parameter with TTLOption.perRow option. The official documentation has following example for RDDs:
import com.datastax.spark.connector.writer._
...
rdd.saveToCassandra("test", "tab", writeConf = WriteConf(ttl = TTLOption.perRow("ttl")))
In your case you need to replace "ttl" with the name of your column with TTL.
I'm not sure that you can set this directly on DataFrame, but you can always get RDD from DataFrame, and use saveToCassandra with WriteConf...
Update in September 2020th: support for writetime and ttl in dataframes was added in the Spark Cassandra Connector 2.5.0

Related

Cassandra 3.7 CDC / incremental data load

I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.

Using cassandra's ttl() in where clause

I'd like to ask if its possible to get rows from cassandra, that have ttl(time to live) bigger than 0. So in the next step i can update those rows with ttl 0. The goals is basically to change the ttl of all the columns for every entry in db to 0.
I've tried SELECT * FROM table where ttl(column1) > 0, but it seems its not possible to use ttl() function in where clause.
I also found a way where we can export all the rows to csv, delete the data in our table and import them again from csv with new ttl. That works but its dangerous because we have over million entries on production and we do not know how it will behave.
You can't do this with CQL only - you need to have support from some tool, for example:
DSBulk - you can unload all your data into CSV file, and load back with new TTL set (if you set it to 0, then just load data back). Here is a blog post that shows how to use DSBulk with TTL. But you can't have condition on the TTL, that's why you need to unload all your data
Spark with Spark Cassandra Connector (even in the local master mode). Version 2.5.0 supports TTL in the Dataframe API (earlier versions supported it only for RDD API) - for Spark 2.4 you need to correctly register functions. This could be done one time, directly in the spark-shell with something like this (you need to adjust your columns in the select & filter statements):
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load
val ttlData = data.select(ttl("col1").as("col_ttl"), $"col2", $"col3").filter($"col_ttl" > 0)
ttlData.drop("col_ttl").write.cassandraFormat("table", "keyspace").mode("append").save

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?
As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date
To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

spark Dataframe execute UPDATE statement

Hy guys,
I need to perform jdbc operation using Apache Spark DataFrame.
Basically I have an historical jdbc table called Measures where I have to do two operations:
1. Set endTime validity attribute of the old measure record to the current time
2. Insert a new measure record setting endTime to 9999-12-31
Can someone tell me how to perform (if we can) update statement for the first operation and insert for the second operation?
I tried to use this statement for the first operation:
val dfWriter = df.write.mode(SaveMode.Overwrite)
dfWriter.jdbc("jdbc:postgresql:postgres", tableName, prop)
But it doesn't work because there is a duplicate key violation. If we can do update, how we can do delete statement?
Thanks in advance.
I don't think its supported out of the box yet by Spark. What you can do it iterate over the dataframe/RDD using the foreachRDD() loop and manually update/delete the table using JDBC api.
here is link to a similar question :
Spark Dataframes UPSERT to Postgres Table

Spark Cassandra connector - where clause

I am trying to do some analytics on time series data stored in cassandra by using spark and the new connector published by Datastax.
In my schema the Partition key is the meter ID and I want to run spark operations only on specifics series, therefore I need to filter by meter ID.
I would like then to run a query like: Select * from timeseries where series_id = X
I have tried to achieve this by doing:
JavaRDD<CassandraRow> rdd = sc.cassandraTable("test", "timeseries").select(columns).where("series_id = ?",ids).toJavaRDD();
When executing this code the resulting query is:
SELECT "series_id", "timestamp", "value" FROM "timeseries" WHERE token("series_id") > 1059678427073559546 AND token("series_id") <= 1337476147328479245 AND series_id = ? ALLOW FILTERING
A clause is automatically added on my partition key (token("series_id") > X AND token("series_id") <=Y) and then mine is appended after that. This obviously does not work and I get an error saying: "series_id cannot be restricted by more than one relation if it includes an Equal".
Is there a way to get rid of the clause added automatically? Am I missing something?
Thanks in advance
The driver automatically determines the partition key using table metadata it fetches from the cluster itself. It then uses this to append the token ranges to your CQL so that it can read a chunk of data from the specific node it's trying to query. In other words, Cassandra thinks series_id is your partition key and not meter_id. If you run a describe command on your table, I bet you'll be surprised.

Resources