how to archive or delete cassandra data after receiving event in wso2bam - cassandra

I use WSO2BAM in version 2.3.0 where I defined a stream holding much amount of data in Cassandra datasource. Currently my Hive script processes all events from keyspace where 99% of data is unneccesary. And it takes disk space too.
My idea is to clear this data after it becomes unnecessary.
The format of stream is:
{"streamId":"kroki_i_kolejki_zlecen:1.0.0","name":"kroki_i_kolejki_zlecen","version":"1.0.0","nickName":"Kroki i kolejki zlecen","description":"Wyniki i daty zamkniecia zlecen","payloadData":[{"name":"casenum","type":"STRING"},{"name":"type_id","type":"STRING"},{"name":"id_zlecenie","type":"STRING"},{"name":"sid","type":"STRING"},{"name":"step_name","type":"STRING"},{"name":"proc_name","type":"STRING"},{"name":"step_desc","type":"STRING"},{"name":"audit_date","type":"STRING"},{"name":"audit_usecs","type":"STRING"},{"name":"user_name","type":"STRING"}]}
My intention is to delete data with the same column payload_id_zlecenie after I receive event with specific payload_type_id.
In relational database it would be equal to query:
delete from kroki_i_kolejki_zlecen where payload_id_zlecenie = [argument];
Is it possible to do?

In Hive you cannot delete Cassandra data according to my knowledge. The [1] link given by Inosh describes how to archive Cassandra records older than a specific time duration. (e.g. records older than 3 months) All the archived data will be stored in a column family with the postfix, "_arch". In that feature a custom analyzer is used inside the generated Hive script to delete Cassandra rows. And also note that deleted records will take about 10 days to completely delete entire rows with it's row key. Until that happens you will see some empty fields associated with the Cassandra row ID.
Inosh's [2] is the real solution for your problem. Once incremental processing is enabled, hive script will process only the Cassandra rows unprocessed in the previous hive script execution. That means, the Hive will aggregate the values processed in each execution and will keep them for future. The next time hive will use that value, and previously processed last timestamp and process all the records came after that timestamp. The new aggregated value and older aggregated value will be used to get the overall value.
[1] - http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660

You can use Cassandra data archival feature [1] to archive cassandra data.
Also refer to Incremental Analysis [2] which is a new feature released with BAM 2.4.0. Using that feature, received data can be analyzed incrementally, without processing all events in CFs.
[1] - http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660

Related

Cassandra 3.7 CDC / incremental data load

I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?
As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date
To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

How Cassandra handle duplicated data when reading from SSTable

In Datastax's documentation, it said:
During a write, Cassandra adds each new row to the database without
checking on whether a duplicate record exists. This policy makes it
possible that many versions of the same row may exist in the database.
As far as I understand, that means there are possibly more than 1 non-compacted SSTables that contains different versions of the same row. How does Cassandra handle duplicated data when it read data from these SSTables?
#quangh : As already stated in document :
This is why Cassandra performs another round of comparisons during a read process. When a client requests data with a particular primary key, Cassandra retrieves many versions of the row from one or more replicas. The version with the most recent timestamp is the only one returned to the client ("last-write-wins").
All the writes operation have a timestamp associated. In this case different node will have different version of same row. But during read operation Cassandra will pick row with latest timestamp. I hope this solves your query.

How to update or even reset rows in persistent table given multiple simultaneous readers?

I have an exchangeRates table that gets updated in batch once per week. This is to be used by other batch and streaming jobs, across different clusters - thus I want to save this as a persistent, shared table for all to jobs share.
allExchangeRatesDF.write.saveAsTable("exchangeRates")
How best then (for the batch job that manages this data) to gracefully update the table contents (actually overwrite it completely) - considering the various spark job as consumers of it and particularily giving its use in some 24/7 structured streaming streams?
Ive checked the APIs, maybe I am missing something obvious! Very likely.
Thanks!
I think you expect some kind of transaction support from Spark so when there's saveAsTable in progress Spark would hold all writes until the update/reset has finished.
I think that the best way to deal with the requirement is to append new records (using insertInto) with the batch id that would denote the rows that belong to a "new table".
insertInto(tableName: String): Unit Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
You'd then use the batch id to deal with the rows as if they were the only rows in the dataset.

Can I detect conflicts when writing to Cassandra?

Is there some timestamp/counter that can be used to validate that in a read-modify-write cycle, the data in the row did not change between reading and modifying?
In other words, can I read some kind of ID while reading the row, and when I write it back tell Cassandra what that ID was, and the write then fails if the ID changed since then? (Which amounts to saying that some other write took place after I read the data)
Each column in cassandra is a Tuple (or a triplet) that contains a name, value and a timestamp. The timestamp of the column represents the last time it was modified. If you have 100's of nodes, whichever node has an update with a the most recent timestamp will win. This is how Eventual Consistency is achieved.
zznate has a good presentation: Introduction to Apache Cassandra for Java Developers where this topic is referenced (slide 37)
Accessing timestamp of a Cassandra column
In summary, you don't need "some kind of ID" when you have the ability to retrieve the timestamp for a given column representing the last time it was modified. However, at scale, with 100's of nodes, how can you be sure that the node you are connecting to, has the most up to date column? (refer back to the zznate presentation)
Point is, you can't, without enabling transactions:
Cassandra - transaction support
Cassandra Transaction with ZooKeeper - Does this work?
how to integrate cassandra with zookeeper to support transactions
And many more: cassandra & transactions

Resources