High number of tombstones with TTL columns in Cassandra - cassandra

I have a cassandra Column Family, or CQL table with the following schema:
CREATE TABLE user_actions (
company_id varchar,
employee_id varchar,
inserted_at timeuuid,
action_type varchar,
PRIMARY KEY ((company_id, employee_id), inserted_at)
Basically a composite partition key that is made up of a company ID and an employee ID, and a clustering column, representing the insertion time, that is used to order the columns in reverse chronological order (newest actions are at the beginning of the row).
Here's what an insert looks like:
INSERT INTO user_actions (company_id, employee_id, inserted_at, action_type)
VALUES ('acme', 'xyz', now(), 'started_project')
USING TTL 1209600; // two weeks
Nothing special here, except the TTL which is set to expire in two weeks.
The read path is also quite simple - we always want the latest 100 actions, so it looks like this:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
LIMIT 100;
The issue: I would expect that since we order in reverse chronological order, and the TTL is always the same amount of seconds on insertion - that such a query should not scan through any tombstones - all "dead" columns are at the tail of the row, not the head. But in practice we see many warnings in the log in the following format:
WARN [ReadStage:60452] 2014-09-08 09:48:51,259 SliceQueryFilter.java (line 225) Read 40 live and 1164 tombstoned cells in profiles.user_actions (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=1410169639669000, localDeletion=1410169639}
and on rare occasions the tombstone number is large enough to abort the query completely.
Since I see this type of schema design being advocated quite often, I wonder if I'm doing something wrong here?

Your SELECT statement is not giving an explicit sort order and is hence defaulting to ASC (even though your clustering order is DESC).
So if you change your query to:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
ORDER BY inserted_at DESC
LIMIT 100;
you should be fine

Perhaps data is reappearing because a node fails and gc_grace_seconds expired already, the node comes back into the cluster, and Cassandra can't replay/repair updates because the tombstone disappeared after gc_grace_seconds: http://www.datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
The 2.1 incremental repair sounds like it might be right for you: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html


YCQL Secondary indexes on tables with TTL in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I have a table with TTL and a secondary index, using YugabyteDB 2.9.0 and I’m getting the following error when I try to insert a row:
SyntaxException: Feature Not Supported
Below is my schema:
CREATE TABLE lists.list_table (
item_value text,
list_id uuid,
created_at timestamp,
updated_at timestamp,
is_deleted boolean,
valid_from timestamp,
valid_till timestamp,
metadata jsonb,
PRIMARY KEY ((item_value, list_id))
) WITH default_time_to_live = 0
AND transactions = {'enabled': 'true'};
CREATE INDEX list_created_at_idx ON lists.list_table (list_id, created_at)
WITH transactions = {'enabled': 'true'};
We have two types of queries (80% & 20% distribution):
select * from list_table where list_id= <id> and item_value = <value>
select * from list_table where list_id= <id> and created_at>= <created_at>
We expect per list_id there would be around 1000-10000 entries.
The TTL would be around 1 month.
It is a restriction, it’s currently not supported to transactionally expire rows using TTL out of a table which are indexed (i.e. atomic expiry of TTL entries in both table and index). There are several workarounds to this:
a) In YCQL, we also support an index with a weaker consistency. This is not well documented today, but you can see the details here: https://github.com/YugaByte/yugabyte-db/issues/1696
The main issue to call out when using this variant of index is that error handling (on INSERT failure), is that it is an application side responsibility to retry the INSERT on failure. As noted in the above issue << If an insert/update or batch of such operations fails, it is the app's responsibility to retry the operation so that the index is consistent. Much like in a 2-table case, it would have been the apps responsibility to retry (in case of a failure between the update to the two tables) to make sure both tables are in sync again. >>
This type of index supports a TTL at the table & index level. (which is recommended to keep the same): https://github.com/yugabyte/yugabyte-db/issues/2481#issuecomment-537177471
b)Another workaround is to use a background cleanup job to periodically delete stale records (instead of using TTL).
c)Avoid using indexes and store data in two tables. one organized by the original primary key and one organized by the index columns you wanted (as the primary key). Both tables can have TTL. But it is an application side responsibility to INSERT to both tables when data is added to the database.
The first table's PK would be ((list_id, item_value)), identical to the current main table. nstead of an index you'll have a second table; the second table's PK would be ((list_id), created_at) and both tables would have a TTL. The application must insert the data into both tables. In the 2nd table you have a choice:
(option 1) Duplicate all the columns from the main table here including your JSON columns etc. This makes Q2 lookup fast, the row has everything it needs; but increases your storage requirements.
(option 2): In addition to the PK, just store the item_value column in the second table. For Q2, you must first lookup the 2nd table and get the item_value, and then use list_id and item_value and retrieve the data from the main table (much like an index would do under the covers).
d)Another workaround, is if we could avoid the index and pick the PK to be ((list_id, item_value), created_at).
This would not affect the performance of Q1 because with (where list_id and item_value) provided it can use the PK to find the rows. But it would be slower for Q2 where list_id and created_at are provided because while it can still use list_id, it must filter out the data using the created_at value without the help of an index. So if Q2 is really 20% of your queries, you probably do not want to scan 1 to 10k items to find your matching row.
To clarify option (c), with the example in mind:
The first table's PK would be ((list_id, item_value)); it is the same as your current main table. Instead of an index you'll have a second table; the second table's PK would be ((list_id), created_at).
both tables would have a TTL
The application would have to insert entries into both tables.
In the 2nd table you have a choice:
(option 1) duplicate all the columns from the main table, including your JSON columns etc. This makes Q2 lookup fast, the row has everything it needs; but increases your storage requirements.
(option 2): in addition to the Primary Key, just store the item_value column in the second table. For Q2, you must first lookup the 2nd table and get the item_value, and then use list_id and item_value and retrieve the data from the main table (much like an index would do under the covers)

Cassandra select query giving timeout error with Gocql driver

I am getting timeout error when executing more than 2000 SELECT queries simultaneously. I am using gocql client for Cassandra 3.7 (JAVA version 8).
"error":"gocql: no response received from cassandra within timeout period"...
I am having following table as schema,
CREATE TABLE my_db.my_message (
id text,
message_id uuid,
message text,
version text,
status tinyint,
PRIMARY KEY (id, message_id)
CREATE INDEX IF NOT EXISTS ON my_db.my_message(status);
Below is my query that gives timeout error when executing more than 2000 queries simultaneously.
"SELECT * FROM my_db.my_message WHERE id=? AND status = ?"
'id' is primary key and 'status' is secondary index in where clause.
'message_id' is also primary key but not used in this select query.
Any help would be appreciated. Thanks in advance.
Do not use index on frequently updated or deleted column
Remember when not to use an index
On high-cardinality columns for a query of a huge volume of records for a small number of results
In tables that use a counter column.
On a frequently updated or deleted column
To look for a row in a large partition unless narrowly queried
I think your column status has low cardinality and frequently updated. Since you are narrowing your search by providing partition key id, So low cardinality is not a problem for you. The main problem is you frequently update indexed column status. Every time you update cassandra store a tombstone.
Cassandra stores tombstones in the index until the tombstone limit reaches 100K cells. After exceeding the tombstone limit, the query that uses the indexed value will fail.
So you should filter data by status column value in the application layer.

Cassandra sorting the results by non-clustering key

Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition
CREATE TABLE blogs_by_visitor (
blogposturl text,
visitor text,
visited_ts timestamp,
PRIMARY KEY (blogposturl, visitor)
Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.
In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.
Select visitor from blog_table
blogposturl = ?
and rownum <= 10
order by timestamp desc
An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.
I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).
Any idea on how we could model the table differently?
The actual table has additional columns, reduced it here for simplicity
These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.
Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.
Let's create the Materialized View :
FROM blogs_by_visitor
WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
PRIMARY KEY (blogposturl, visited_ts, visitor)
WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);
Now you can just select the 10 recent unique visitor of a blogpost.
SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;
you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC
Note That : The above schema will result huge amount of unexpected tombstone generation in the Materialized Views
Or You could change your table schmea like below :
CREATE TABLE blogs_by_visitor (
blogposturl text,
year int,
month int,
day int,
visitor text,
visited_ts timestamp,
PRIMARY KEY ((blogposturl, year, month, day), visitor)
Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.

Data modelling ( secondary index vs clustering key )

I am trying to understand if it's going to be a performance issue if I choose
very high unique value column as partition key ( order_id), and create indexes on store_id and status. ( i can query on order_id | store_id | status | both store&status , and also ***update(important) status based on order_id)
Option 2:
store_id as partition_key and very high unique value column as clustering key ( order_id) and create secondary index on status ( so that i can filter on status)
( I can query on store_id | store&order_id | store&status | and also **update status based on store&order_id )
I would like to know what will be the performance issues in above scenarios. which one will be a better option. Thank you very much for your help and time.
Option 1 is interesting, but you need to be careful with your indices. See your other question for more information there (especially the bit concerning querying multiple secondary indices at the same time). That may be alleviated with tables purpose built for your index lookups (further discussed below).
The advantage of the highly unique partition key is that data will be more distributed around your cluster. The downside here is that when you perform a request with WHERE store_id = 'foo' all nodes in the cluster need to be queried as there is no limit on the partition key.
Option 2 you must be careful with. If your partition key is just store_id, then every order will be placed within this partition. For each order there will be n columns added to the single row for the store representing each attribute on the order. In regards to data location all orders for a given store will be placed on the same Cassandra node.
In both cases why not pursue a lookup table for orders by status? This will remove your need for a secondary index on that field. Especially given it's relatively small cardinality.
CREATE TABLE orders_by_store_id_status (
store_id VARCHAR,
status VARCHAR,
order_id VARCHAR,
... <additional order fields needed to satisfy your query> ...
PRIMARY KEY ((store_id, status), order_id)
This would allow you to query for all orders with a given store_id and status.
SELECT * FROM orders_by_store_id_status WHERE store_id = 'foo' AND status = 'open';
The read is fast as the partition key limits the number of nodes we perform the query against.

Using secondary indexes to update rows in Cassandra 2.1

I'm using Cassandra 2.1 and have a model that roughly looks as follows:
client_id bigint,
bucket int,
timestamp timeuuid,
ticket_id bigint,
PRIMARY KEY ((client_id, bucket), timestamp)
CREATE INDEX events_ticket ON events(ticket_id);
As you can see, I've created a secondary index on ticket_id. This index works ok. events contains around 100 million rows, while only 5 million of these rows have around 50,000 distinct tickets. So a ticket - on average - has 100 events.
Querying the secondary index works without supplying the partition key, which is convenient in our situation. As the bucket column is sometimes hard to determine beforehand (i.e. you should know the date of the events, bucket is currently the date).
cqlsh> select * from events where ticket_id = 123;
client_id | bucket | timestamp | ... | ticket_id
(0 rows)
How do I solve the problem when all events of a ticket should be moved to another ticket? I.e. the following query won't work:
cqlsh> UPDATE events SET ticket_id = 321 WHERE ticket_id = 123;
InvalidRequest: code=2200 [Invalid query] message="Non PRIMARY KEY ticket_id found in where clause"
Does this imply secondary indexes cannot be used in UPDATE queries?
What model should I use to support these changes?
First of all, UPDATE and INSERT operations are treated the same in Cassandra. They are colloquially known as "UPSERTs."
Does this imply secondary indexes cannot be used in UPDATE queries?
Correct. You cannot perform an UPSERT in Cassandra without specifying the complete PRIMARY KEY. Even UPSERTs with a partial PRIMARY KEY will not work. And (as you have discovered) UPSERTing by an indexed value does not work, either.
How do I solve the problem when all events of a ticket should be moved to another ticket?
Unfortunately, the only way to accomplish this, is to query the keys of each row in events (with a particular ticket_id) and UPSERT ticket_id by those keys. The nice thing, is that you don't have to first DELETE them, because ticket_id is not part of the PRIMARY KEY.
How do I solve the problem when all events of a ticket should be moved to another ticket?
I think your best plan here would be to forego a secondary index all together, and create a query table to work alongside your events table:
CREATE TABLE eventsbyticketid (
client_id bigint,
bucket int,
timestamp timeuuid,
ticket_id bigint,
PRIMARY KEY ((ticket_id), timestamp)
This would allow you to query by ticket_id quickly (to obtain your client_id, bucket, and timestamp. This would give you the information you need to UPSERT the new ticket_id on your events table.
You could also then perform a DELETE by ticket_id (on the eventsbyticketid table). Cassandra does allow a DELETE operation with a partial PRIMARY KEY, as long as you have the full partition key (ticket_id). So removing old ticket_ids from the query table would be easy. And to ensure write atomicity, you could batch the UPSERTs together:
UPDATE events SET ticket_id = 321 WHERE client_id=2112 AND bucket='2015-04-22 14:53' AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d;
UPDATE eventsbyticketid SET client_id=2112, bucket='2015-04-22 14:53' WHERE ticket_id=321 AND timestamp=4a7e2730-e929-11e4-88c8-21b264d4c94d
Which is actually the same as performing:
INSERT INTO events (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
INSERT INTO eventsbyticketid (client_id,bucket,timestamp,ticketid) VALUES(2112,'2015-04-22 14:53',4a7e2730-e929-11e4-88c8-21b264d4c94d,321);
Side note: timestamp is actually a (reserved word) data type in Cassandra. This makes it a pretty lousy name for a timeuuid column.
You can use the secondary index to query the events for the old ticket, and then use the primary key from those retrieved events to update the events.
I'm not sure why you need to do this manually, seems like something Cassandra should be able to do under the hood.
