Reinserting data after row deletion in cassandra using Pelops - cassandra

I am trying to re-insert data for same row-key after deleting the row but they are not getting inserted. Neither any exception is thrown.
I am using Pelops RowDeletor to delete the row data (Note that the row-key is still shown with no columns) after deleting. If I truncate the table and reinsert columns gets inserted.
I have tried changing consistency levels from ANY to ONE to ALL. Any ideas as to whats the problem or should I go for Hector client?

This can be an issue with the tombstones (keys without columns) if your timestamp on your column is in the past. Make sure this is not the case, and you should be able to insert. Note that this is not an issue with Pelops but is related to Cassandra's conflict resolution. If you have a tombstone that's newer than the insert, you will have this issue because Cassandra sees the delete as having happened after the insert.

Related

Deleting column in cassandra for large dataset

We have a redundant column that we'd like to delete from our Cassandra database (version 2.1.15). This is a text column represents the majority of data on disk (15 nodes X 1.8 TB per node).
The easiest option just seems to be an alter table to remove that column, and then let Cassandra compaction take care of things (also running Cassandra Reaper to manage repairs). However, given the size of the dataset I'm concerned I will knock over the cluster with a massive delete.
Other options I've consider is a process that will run through the keyspace setting the value to null, but I think this will have the same effect as removing the column, but is more under out control (but also requires writing something to do this).
Would anyone have any advice on how to approach this?
Thanks!
Dropping a column does mark the deleted values as tombstones. The column value becomes unavailable immediately and the column data is removed in the next compaction cycle.
If you want to to expedite the removal of the column before the compaction occurs, you can run nodetool upgradesstables to remove the data, after you use the ALTER TABLE command to change the metadata for the column.
See Documentation: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/alter_table_r.html
If I remember correctly, drop of column doesn't really mark the deleted values with tombstone, but instead inserts corresponding entry into system.dropped_columns table, and then code, like, SerializationHelper & BTreeRow, performs filtering on the fly. The data will be deleted when compaction will happen.
Explicitly setting the value to null won't make situation better because you'll add data to the table.
I would recommend to test deletion on small cluster & check how it behaves.

Update the column I searched for

is there any possibility to update a column-value in cassandra that I searched for (is part of my primary key)?
I have a (huge) list of items with a field calld "LastUpdateDateTime" and from time to time I search for columns that haven't updated for a while.
So, the reason i searched for this columns is cause I want to update them and after I update them I want to set the timestamp to the current date.
How to do this with cassandra?
You can't update primary key column, It will insert another record.
That's how cassandra work.
May be you will have to use spark-cassandra connector OR Delete the records with old values and insert new values.
Note: Deleting and inserting is not recommended if you have many records as it will create corresponding number of tombstones

Why don't an upsert create Tombstones in Cassandra?

As per Question regarding Tombstone, why doesn't upserts create tombstones?
As per datastax documentation, How is data updated ? for every upsert, cassandra considers as delete followed by insert, as the new timestamps of the insert overwrites the old timestamp. The old timestamp data has to be marked as delete which relates to tombstone.
Why do we have contradicting statements? or else am I missing anything here?
Usecase:
Data is inserted with unique key (uuid) in Cassandra and some of the columns in this data keeps updating frequently. Which approach do you recommend?
Inserting the same data with new column values in the
Insert query.
Updating the existing record based on given uuid
with new column values in the update query.
Which approach does or doesn't create tombstones? and how does Cassandra handle both queries?
As Russ pointed out, you may want to read other similar questions on this topic. However,
An upsert/overwrite is just-another-cell, with a name, a timestamp and a value.
A tombstone is just like an overwrite, except it gets one extra field indicating that it's been deleted, so that it isn't returned as valid output. The reason tombstones are often harmful is that they can accumulate in bad data models, even when people think the data is gone - and skipping them to get to live data actually requires memory.
When you update/upsert as you describe, the cell you create SHADOWS (obsoletes) the previous cell, which will be removed upon compaction. That previous cell is NOT a tombstone, even though it's no longer live/active - it will be compacted away and completely replaced by the new, live, highest-timestamp value as soon as compaction allows.
The biggest thing to keep in mind is this: tombstones aren't necessarily removed by compaction - they're kept around (persisted/rewritten) for at least gc_grace_seconds, and potentially even long if they need to shadow/cover other cells in sstables not-yet-compacted. Because of this, tombstones stay around for a long time, but shadowed/overwritten cells are gc'd as soon as the sstable they're in is compacted.

Overwrite row in cassandra with INSERT, will it cause tombstone?

Writing data to Cassandra without causing it to create tombstones are vital in our case, due to the amount of data and speed. Currently we have only written a row once, and then never had the need to update the row again, only fetch the data again.
Now there has been a case, where we actually need to write data, and then complete it with more data, that is finished after awhile.
It can be made by either;
overwrite all of the data in a row again using INSERT (all data is available), or
performing an Update only on the new data.
What is the best way to do it, bear in mind of the speed and not creating a tombstone is of importance ?
Tombstones will only created when deleting data or using TTL values.
Cassandra does align very well to your described use case. Incrementally adding data will work for both INSERT and UPDATE statements. Cassandra will store data in different locations in case of adding data over time for the same partition key. Periodically running compactions will merge data again for a single key to optimize access and free disk space. This will happend based on the timestamp of written values but does not create any new tombstones.
You can learn more about how Cassandra stores data e.g. here.
It would be more efficient to do an update to add new or changed data. There is no need to rewrite the old data that isn't changing and it would be inefficient to make Cassandra rewrite it.
When you do an insert or update, Cassandra keeps a timestamp for the modify time for each column. When you do a read, Cassandra collects all the writes for that key from in memory, from on disk, and from other replicas depending on the consistency setting. It will then merge the column data so that the newest value is used for each column.
When data is compacted on disk, if there are separate updates for different columns of a row, those will be combined into a single row in the compacted data.
You don't need to worry about creating tombstones by doing an update unless you are using an update to set a TTL (Time To Live) value. In your application it sounds like you never delete data, so you will never have any tombstones.

When I remove rows in Cassandra I delete only columns not row keys

If I delete every keys in a ColumnFamily in a Cassandra db using remove(key), then if I use get_range_slices, rows are still there but without columns. How could I remove entire rows?
Why do deleted keys show up during range scans?
Because get_range_slice says, "apply this predicate to the range of rows given," meaning, if the predicate result is empty, we have to include an empty result for that row key. It is perfectly valid to perform such a query returning empty column lists for some or all keys, even if no deletions have been performed.
Cassandra uses Distributed Deletes as expected.
Thus, a delete operation can't just wipe out all traces of the data
being removed immediately: if we did, and a replica did not receive
the delete operation, when it becomes available again it will treat
the replicas that did receive the delete as having missed a write
update, and repair them! So, instead of wiping out data on delete,
Cassandra replaces it with a special value called a tombstone. The
tombstone can then be propagated to replicas that missed the initial
remove request.
http://wiki.apache.org/cassandra/DistributedDeletes
Just been having the same issue and I found that:
This has been fixed in 0.7
(https://issues.apache.org/jira/browse/CASSANDRA-1027).
And backported to 0.6.3
This is also relevant:
https://issues.apache.org/jira/browse/CASSANDRA-494

Resources