Truncate silver delta live table and reload - databricks

I have a parameter value which determines whether the table needs to be full load or an incremental load. In delta live tables, incremental load is not an issue as we apply changes and specify whether the table needs to be SCD1 or SCD2.
However, there are scenarios where the table is not an incremental table. Every time the data comes in, the silver table needs to be truncated and reloaded again completely. I tried setting the value apply_as_truncates =True, but this doesn't work.
dlt.apply_changes(
source=bronze_tablename,
target=silver_tablename,
keys=key_fields,
sequence_by=sequence_by,
apply_as_truncates = True,
stored_as_scd_type=1
)`
Could you please let us know how a DLT in silver can be truncated and reloaded?
Thanks in advance.
Regards,
R

Related

how to avoid errors when querying hive table being loaded by Spark at the same time

We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time. What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?
The error is:
java.io.FileNotFoundException: File does not exist [HDFS PATH]
Which i think happens because the metadata says there is a file A that gets deleted during the job execution.
The table is partitioned by year, month, day(using HDFS as storage) and every time the ETL runs it updates(via a partition overwrite) only current date partition. Currently no "transactional" tables are enabled in the cluster(even if they were i tested the use case on a test cluster without luck)
The easy option is to use a table format thats designed to handle concurrent reads and writes like hudi or delta lake. The more complicated version involves using a partitioned append only table that the writer writes to. On completion the writer updates a view to point to the new data. Another possible option is to partition the table on insert time.
Have a set of two tables and a view over them:
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);
CREATE VIEW foo AS SELECT x, y, z, ... FROM foo_a;
First iteration of ETL process needs to:
Synchronize foo_a -> foo_b
Do the work on foo_b
Drop view foo and recreate it pointing to foo_b
Until step 3 user queries run against table foo_a. From the moment of switch they run against foo_b. Next iteration of ETL will work in the opposite way.
This is not perfect. You need double storage and some extra complexity in the ETL. And anyway this approach might fail if:
user is unlucky enough to hit a short time between dropping and
recreating the view
user submits a query that's heavy enough to run across two iterations of ETL
not sure but check it out
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);

Cassandra query table without partition key

I am trying to extract data from a table as part of a migration job.
The schema is as follows:
CREATE TABLE IF NOT EXISTS ${keyspace}.entries (
username text,
entry_type int,
entry_id text,
PRIMARY KEY ((username, entry_type), entry_id)
);
In order to query the table we need the partition keys, the first part of the primary key.
Hence, if we know the username and the entry_type, we can query the table.
In this case the username can be whatever, but the entry_type is an integer in the range 0-9.
When doning the extraction we iterate the table 10 times for every username to make sure we try all versions of entry_type.
We can no longer find any entries as we have depleted our list of usernames. But our nodetool tablestats report that there is still data left in the table, gigabytes even. Hence we assume the table is not empty.
But I cannot find a way to inspect the table to figure out what usernames remains in the table. If I could inspect it I could add the usernames left in the table to our extraction job and eventually we could deplete the table. But I cannot simply query the table as such:
SELECT * FROM ${keyspace}.entries LIMIT 1
as cassandra requires the partition keys to make meaningful queries.
What can I do to figure out what is left in our table?
As per the comment, the migration process includes a DELETE operation from the Cassandra table, but the engine will have a delay before actually removing from disk the affected records; this process is controlled internally with tombstones and the gc_grace_seconds attribute of the table. The reason for this delay is fully explained in this blog entry, for a tl dr, if the default value is still in place, Cassandra will need to pass at least 10 days (864,000 seconds) from the execution of the delete before the actual removal of the data.
For your case, one way to proceed is:
Ensure that all your nodes are "Up" and "Healthy" (UN)
Decrease the gc_grace_seconds attribute of your table, in the example, it will set it to 1 minute, while the default is
ALTER TABLE .entries with GC_GRACE_SECONDS = 60;
Manually compact the table:
nodetool compact entries
Once that the process is completed, nodetool tablestats should be up to date
To answer your first question, I would like to put more light on gc_grace_seconds property.
In Cassandra, data isn’t deleted in the same way it is in RDBMSs. Cassandra is designed for high write throughput, and avoids reads-before-writes. So in Cassandra, a delete is actually an update, and updates are actually inserts. A “tombstone” marker is written to indicate that the data is now (logically) deleted (also known as soft delete). Records marked tombstoned must be removed to claim back the storage space. Which is done by a process called Compaction. But remember that tombstones are eligible for physical deletion / garbage collection only after a specific number of seconds known as gc_grace_seconds. This is a very good blog to read more in detail : https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Now possibly you are looking into table size before gc_grace_seconds and data is still there.
Coming to your second issue where you want to fetch some samples from the table without providing partition keys. You can analyze your table content using Spark. The Spark Cassandra Connector allows you to create Java applications that use Spark to analyze database data. You can follow the articles / documentation to write a quick handy spark application to analyze Cassandra data.
https://www.instaclustr.com/support/documentation/cassandra-add-ons/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/spark/sparkJavaApi.html
I would recommend not to delete records while you do the migration. Rather first complete the migration and post that do a quick validation / verification to ensure all records are migrated successfully (this use can easily do using Spark buy comparing dataframes from old and new tables). Post successful verification truncate the old table as truncate does not create tombstones and hence more efficient. Note that huge no of tombstone is not good for cluster health.

Load data from one table to another every 10 mins - Cassandra

We have a stream of data coming to Table A every 10 mins. No history preserved. The existing data has to be flushed to a new table B every time data is loaded in Table A. Can this be done dynamically or automated in Cassandra?
I can think of loading the Table A into a CSV file and then loading back to Table B every time Table A is flushed. But i would like to have something done at the database level itself.
Any ideas or suggestions appreciated.
Thanks,
Arun
For smaller amounts of data you could put this into cron:
https://dba.stackexchange.com/questions/58901/what-is-a-good-way-to-copy-data-from-one-cassandra-columnfamily-to-another-on-th
If larger and running newer versions of cassandra (3.8+)
http://cassandra.apache.org/doc/latest/operating/cdc.html
https://issues.apache.org/jira/browse/CASSANDRA-8844
and then replay the data to the table that you need (by some sort of outside process, script, app etc ...).
Basically there are already some tools around like:
https://github.com/carloscm/cassandra-commitlog-extract
You could use the samples there to cover your use-case.
But for most use cases this is handled at the application level, writes are relatively cheap with cassandra.

MemSQL for Last n Days Data

I plan to use memsql to store my last 7 days data for real time analytics using SQL.
I checked the documentation and find out that there is no such TTL / expiration feature in MemSQL
Is there any such feature (in case I missed it)?
Is memsql fit the use case if I do daily delete on >7 days data? I quite curious about the fragmentation
We tried it on postgresql and we need to execute Vacuum command, it takes a long time to run.
There is no TTL/expiration feature. You can do it by running delete queries. Many customer use cases are doing this type of thing, so yes MemSQL does fit the use case. Fragmentation generally shouldn't be too much of a problem here - what kind of fragmentation are you concerned about?
There is No Out of the Box TTL feature in MemSQL.
We achieved TTL by adding an additional TS column in our MemSQL Rowstore table with TIMESTAMP(6) datatype.
This provides automatic current timestamp insertion when you add a new row to the table.
When querying data from this table, you can apply a simple filter based on this TIMESTAMP column to filter older records beyond your TTL value.
https://docs.memsql.com/sql-reference/v6.7/datatypes/#time-and-date
You can always have a batch job which can run one a month which can delete older data.
we have not seen any issues due to fragmentation but you can do below once in a while if fragmentation is a concern for you:
MemSQL’s memory allocators can become fragmented over time (especially if a large table is shrunk dramatically by deleting data randomly). There is no command currently available that will compact them, but running ALTER TABLE ADD INDEX followed by ALTER TABLE DROP INDEX will do it.
Warning
Caution should be taken with this work around. Plans will rebuild and the two ALTER queries are going to move all moves in the table twice, so this should not be used that often.
Reference:
https://docs.memsql.com/troubleshooting/latest/troubleshooting/

Drop table or truncate table in Cassandra, which is better

We have a use case where we need to re-create a table every day with current data in Cassandra. For this should we use drop table or truncate table, which would be efficient? We do not want the data to be backed up etc?
Thanks
Ankur
I think for almost all cases Truncate is a safer operation than a drop recreate. There have been several issues with dropping/recreating in the past with ghost data, schema disagreement, ect... Although there have been a number of fixes to try to make drop/recreate more stable, if its an operation you are performing every day Truncate should be much cheaper and more stable.
Drop table drops the table and all data. Truncate clears all data in the table, and by default creates a snapshot of the data (but not the schema). Efficiency wise, they're close - though truncate will create the snapshot. You can disable this by setting auto_snapshot to false in cassandra yaml config, but it is server wide. If it's not too much trouble, I'd drop and recreate table - but I've seen issues if you don't wait a while after drop before recreating.
Source : https://support.datastax.com/hc/en-us/articles/204226339-FAQ-How-to-drop-and-recreate-a-table-in-Cassandra-versions-older-than-2-1
NOTE: By default, snapshots are created when tables are dropped or truncated. This will need to be cleaned out manually to reclaim disk space.
Tested manually as well.
Truncate will keep the schema though, drop will not.
Beware!
From datastax documentation: https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cqlTruncate.html
Note: TRUNCATE sends a JMX command to all nodes, telling them to delete SSTables that hold the data from the specified table. If any of these nodes is down or doesn't respond, the command fails and outputs a message like the following:
truncate cycling.user_activity;
Unable to complete request: one or more nodes were unavailable.
Unfortunately, there is nothing on the documentation saying if DROP behaves differently

Resources