Cassandra Old data deletion - cassandra

On cassandra, we only need 100 days of data for specific tables. However, we only recently set the TTL value and the data older than that still stays in the system as stale data. We were thinking of different approaches to delete the old data out of the system. One suggestion was to create a Spark job to identify the data older than a specific timeframe and delete them all.
Another thought was to create a new table with just 100 days data and delete the old table. But I have various doubts on
how to rename the table where live data is being updated,
how will cassandra deal with such a table? While I have recreated a new table with less data and renamed it on one node(say node 1), will the other nodes in the cluster automatically delete the older data in their tables or sync the table on the node 1 and push all the older data onto it?
I am really new to cassandra and require expert advice on this.
Please suggest if there are better ways to handle this.

Cassandra does not have a way to rename a table, you will need to
create the new table with a different name
ensure this table has the TTL clause
load into it only the subset of records that you are interested on; this could be tricky as the query will depend on the schema of the table, is the column with the timestamp part of the clustering key?
update your application to point to the new table
drop the table

Related

In Azure, is there a way to maintain a current external/polybase table while also maintaining history for rolling back transactions?

I have a Delta Table in Azure Databricks that stores history for every change that occurs. I also have a polybase external table for the users to read from in an Azure SQL Data Warehouse/Azure Synapse. But when an update or delete is necessary, I have to vacuum the table in order to update the polybase to the latest, or it has a copy from every previous version. That vacuum, by nature, deletes the history, so I can no longer roll back.
I'm thinking my only choice is to manually keep a row change table with colums = [schema_name, table_name, primary_key, old_value, new_value] so that we can reapply the changes in reverse if necessary. Is there anything more elegant I can do?

How to delete all rows in Cassandra Keyspace

I need to delete all rows in Cassandra but with Amazon Keyspace isn't possible to execute TRUNCATE tbl_name because the TRUNCATE api isn't supported yet.
Now the few ideas that come in my mind are a little bit tricky:
Solution A
select all the rows
cycle all the rows and delete it (one by one or in a batch)
Solution B
DROP TABLE
CREATE TABLE with the structure of the old table
Do you have any idea to keep the process simplest?
Tnx in advance
If the data is not required. Option B - drop the table and recreate. You can pass in the capacity on create table statment using custom table properties.
CREATE TABLE my_keyspace.my_table (
id text,
division text,
project text,
role text,
manager_id text,
PRIMARY KEY (id,division))
WITH CUSTOM_PROPERTIES=
{'capacity_mode':
{'throughput_mode' : 'PROVISIONED',
'read_capacity_units' : 10,
'write_capacity_units' : 20},
'point_in_time_recovery': {'status': 'enabled'}}
AND TAGS={'pii' :'true',
'prod':'true'
};
Option C. If you require the data you can also leverage on-demand capacity mode which is pay-per request mode. With no request you only have to pay for storage. You can change modes once a day.
ALTER TABLE my_keyspace.my_table
WITH CUSTOM_PROPERTIES=
{'capacity_mode': {'throughput_mode': 'PAY_PER_REQUEST'}}
Solution B should be fine in absence of TRUNCATE. In older versions (version prior to 2.1) of Cassandra recreating table with the same name was a problem. Refer article Datastax FAQ Blog. But since then issue has been resolved via CASSANDRA-5202.
If data in table is not required anymore it is better to drop the table and recreate it. Moreover it will be very tedious task if table contains big amount of data.

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

Saving an existing item in table from dataFrame

I have a dataframes which have few rows among them some already exists in db. I want to update few columns of existing rows. How can we do that?
I see we have SaveModes:
append and override which might serve the purpose but there is a limitation in both the cases.
With append, I am getting primary key error, as this option tries to create a new row in db
With ovverride, I will loose values for the unchanged attributes in the tuple.
Can someone please suggest how can I update few attributes(Columns values) of a row(tuple).?
This can be handled in MySql level, The concept is known as upsert.
case when : primary key is new
The SQL will insert into MySQL DB as new row
Case when : primary key is existing
You can use
INSERT
ON DUPLICATE KEY UPDATE
Which will update the key with the new entries/changes.
Read More here and here.
The ideal way to such use case is, insert your data into a temporary table first in your MySQL DB and post that use a trigger in order to load that data into original table. Call that trigger from spark itself.
In spark, dataframes are immutable. So you cannot change a value in place. One way would be to read the complete table, make the modification and write back the complete table in overwrite mode. This will take time.
If your modifications are always for a particular group, say user id based or date based, then you can write the data based on that column using partitionBy(). Then you can read that partition using .filter() do the modifications and overwrite only that partition using insertInto() - from pyspark 2.3.0
Refer this answer for other versions for pyspark :Overwrite specific partitions in spark dataframe write method

How to quickly migrate from one table into another one with different table structure in the same/different cassandra?

I had one table with more than 10,000,000 records in Cassandra, but for some reason, I want to build another Cassandra table with the same fields and several additional fields, and I will migrate the previous data into it. And now the two tables are in the same Cassandra cluster.
I want to ask how to finish this task in a shortest time?
And If my new table in the different Cassandra, How to do it?
Any advice will be appreciated!
If you just need to add blank fields to a table, then the best thing to do is use the alter table command to add the fields to the existing table. Then no copying of the data would be needed and the new fields would show up as null in the existing rows until you set them to something.
If you want to change the structure of the data in the new table, or write it to a different cluster, then you'd probably need to write an application to read each row of the old table, transform the data as needed, and then write each row to the new location.
You could also do this by exporting the data to a csv file, write a program to restructure the csv file as needed, then import the csv file into the new location.
Another possible method would be to use Apache Spark. You'd read the existing table into an RDD, transform and filter the data into a new RDD, then save the transformed RDD to the new table. That would only work within the same cluster and would be fairly complex to set up.

Resources