Cassandra point in time restore - cassandra

I use Cassandra 3.11.4 and I want to restore a backup to a specific point in time , I changed commitlog_archiving.properties file as follows:
archive_command=cp %path /cassandra/commitlog_archiving/%name
restore_command=/bin/cp -f %from %to
restore_directories=/cassandra/restore/
restore_point_in_time=2020:01:22 10:44:00
To restore I load the most recent snapshot and copy corresponding commitlogs (commitlogs produced after snapshot ) into restore_directories, and then restart cassandra, but unfortunately it doesn't work and it seems that it replays all the records even after the restore_point_in_time

Related

What to do to prevent Delta Lake checkpoints to be removed in Azure Databricks?

I noticed that I have only 2 checkpoints files in a delta lake folder. Every 10 commits, a new checkpoint is created and the oldest one is removed.
For instance this morning, I had 2 checkpoints: 340 and 350. I was available to time travel from 340 to 359.
Now, after a "write" action, I have 2 checkpoints: 350 and 360. I'm now able to time travel from 350 to 360.
What can remove the old checkpoints? How can I prevent that?
I'm using Azure Databricks 7.3 LTS ML.
Ability to perform time travel isn't directly related to the checkpoint. Checkpoint is just an optimization that allows to quickly access metadata as Parquet file without need to scan individual transaction log files. This blog post describes the details of the transaction log in more details
The commits history is retained by default for 30 days, and could be customized as described in documentation. Please note that vacuum may remove deleted files that are still referenced in the commit log, because data is retained only for 7 days by default. So it's better to check corresponding settings.
If you perform following test, then you can see that you have history for more than 10 versions:
df = spark.range(10)
for i in range(20):
df.write.mode("append").format("delta").save("/tmp/dtest")
# uncomment if you want to see content of log after each operation
#print(dbutils.fs.ls("/tmp/dtest/_delta_log/"))
then to check files in log - you should see both checkpoints and files for individual transactions:
%fs ls /tmp/dtest/_delta_log/
also check the history - you should have at least 20 versions:
%sql
describe history delta.`/tmp/dtest/`
and you should be able to go to the early version:
%sql
SELECT * FROM delta.`/tmp/dtest/` VERSION AS OF 1
If you want to keep your checkpoints X days, you can set delta.checkpointRetentionDuration to X days this way:
spark.sql(f"""
ALTER TABLE delta.`path`
SET TBLPROPERTIES (
delta.checkpointRetentionDuration = 'X days'
)
"""
)

Retaining Delta log transaction data of Delta Lake forever

I had a small confusion on transactional log of Delta lake. In the documentation it is mentioned that by default retention policy is 30 days and can be modified by property -: delta.logRetentionDuration=interval-string .
But I don't understand when the actual log files are deleted from the delta_log folder. Is it when we run some operation? Or may be VACCUM operation. However, it is mentioned that VACCUM operation only deletes data files and not logs. But will it delete logs older than specified log retention duration?
reference -: https://docs.databricks.com/delta/delta-batch.html#data-retention
delta-io/delta PROTOCOL.md:
By default, the reference implementation creates a checkpoint every 10 commits.
There is an async process that runs for every 10th commit to the _delta_log folder. It will create a checkpoint file and will clean up the .crc and .json files that are older than the delta.logRetentionDuration.
Checkpoints.scala has checkpoint > checkpointAndCleanupDeltaLog > doLogCleanup. MeetadataCleanup.scala has doLogCleanup > cleanUpExpiredLogs.
The value of the option is an interval literal. There is no way to specify literal infinite and months and years are not allowed for this particular option (for a reason). However nothing stops you from saying interval 1000000000 weeks - 19 million years is effectively infinite.

cassandra 2.0.6, commitlog directory 215G for 112G Data

280K ./saved_caches
112G ./data
215G ./commitlog
326G .
How can this be explained? Is there a bug in cassandra 2.0.6?
It seems I accidentally had disabled commitlog_total_space_in_mb. I've reenabled it and restarted cassandra. It took a long time due to the huge log files but it started and cleaned up the extra log files.

In Cassandra 1.2 - CQL 3 is it possible to abort a secondary index build?

Been using a 6GB dataset with each source record being ~1KB in length when I accidentally added an index on a column that I am pretty sure has a 100% cardinality.
Tried dropping the index from cqlsh but by that point the two node cluster had gone into a run away death spiral with loadavg surpassing 20 on each node and cqlsh hung on the drop command for 30 minutes. Since this was just a test setup, I shut-down and destroyed the cluster and restarted.
This is a fairly disconcerting problem as it makes me fear a scenario where a junior developer is on a production cluster and they set an index on a similar high cardinality column. I scanned through the documentation and looked at the options in nodetool but there didn't seem to be anything along the lines of "abort job or abort building index".
Test environment:
2x m1.xlarge EC2 instances with 2 Raid 0 ephemeral disks
Dataset was 6GB, 1KB per record.
My question in summary: Is it possible to abort the process of building a secondary index AND or possible to stop/postpone running builds (indexing, compaction) for a later date.
nodetool -h node_address stop index_build
See: http://www.datastax.com/docs/1.2/references/nodetool#nodetool-stop

How to update Sphinx main and delta indexes

I've read the Sphinx documentation and various resources, but I am confused about the process of maintaining main and delta indexes. Please let me know if this is correct:
Have a table that partitions the search index by last_update_time (NOT id as in the tutorial http://sphinxsearch.com/docs/1.10/delta-updates.html)
Update the delta index every 15 minutes. The delta index only grabs records that have been updated > last_update_time:
indexer --rotate --config /opt/sphinx/etc/sphinx.conf delta
Update the main index every hour by merging delta using:
indexer --merge main delta --merge-dst-range deleted 0 0 --rotate
The pre query SQL will update last_update_time to NOW(), which re-partitions the indexes
Confusion: Will the merge run the pre query SQL?
After the main index is updated, immediately update the delta index to clean it up:
indexer --rotate --config /opt/sphinx/etc/sphinx.conf delta
EDIT: How would deletion of records even work? Since the delta index would contain deleted records, records would only be removed from search queries after the delta index was merged into main?
To deal with the deletes you need to take a look at the killlist, it basically defines removal criteria:
http://sphinxsearch.com/docs/manual-1.10.html#conf-sql-query-killlist
In an example I have we build our main daily, early morning then simply run a delta update (including the killlist) every 5 minutes.
On the merge stuff, I'm not sure as I've never used it.
This is only half of the job. Deleted stuff must be taken care by kill list (kbatch now it is called) and then delta will not show the deleted results. But if you merge - they will reappear. To fix this - you have to do
indexer --merge main delta --merge-dst-range deleted 0 0 --rotate
But in order for this to work - you need an attribute "deleted" to be added to every result that was deleted. Then merge process will filter out results that have deleted=1 and main index will not have deleted results in it.

Resources