Cassandra select query giving timeout error with Gocql driver - cassandra

I am getting timeout error when executing more than 2000 SELECT queries simultaneously. I am using gocql client for Cassandra 3.7 (JAVA version 8).
"error":"gocql: no response received from cassandra within timeout period"...
I am having following table as schema,
CREATE TABLE my_db.my_message (
id text,
message_id uuid,
message text,
version text,
status tinyint,
PRIMARY KEY (id, message_id)
)
CREATE INDEX IF NOT EXISTS ON my_db.my_message(status);
Below is my query that gives timeout error when executing more than 2000 queries simultaneously.
"SELECT * FROM my_db.my_message WHERE id=? AND status = ?"
'id' is primary key and 'status' is secondary index in where clause.
'message_id' is also primary key but not used in this select query.
Any help would be appreciated. Thanks in advance.

Do not use index on frequently updated or deleted column
Remember when not to use an index
On high-cardinality columns for a query of a huge volume of records for a small number of results
In tables that use a counter column.
On a frequently updated or deleted column
To look for a row in a large partition unless narrowly queried
I think your column status has low cardinality and frequently updated. Since you are narrowing your search by providing partition key id, So low cardinality is not a problem for you. The main problem is you frequently update indexed column status. Every time you update cassandra store a tombstone.
Cassandra stores tombstones in the index until the tombstone limit reaches 100K cells. After exceeding the tombstone limit, the query that uses the indexed value will fail.
So you should filter data by status column value in the application layer.

Related

Select row with highest timestamp

I have a table that stores events
CREATE TABLE active_events (
event_id VARCHAR,
number VARCHAR,
....
start_time TIMESTAMP,
PRIMARY KEY (event_id, number)
);
Now, I want to select an event with the highest start_time. It is possible? I've tried to create a secondary index, but no success.
This is a query I've created
select * from active_call order by start_time limit 1
But the error says ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
Should I create some kind of materialized view? What should I do to execute my query?
This is an anti-pattern in Cassandra. To order the data you need to read all data and find the highest value. And this will require scanning of data on multiple nodes, and will be very long.
Materialized view also won't help much as order for data only exists inside an individual partition, so you will need to put all your data into a single partition that could be huge and data would be imbalanced.
I can only think of following workaround:
Have an additional table that will have all columns of the original table, but with a fake partition key and no clustering columns
You do inserts into that table in parallel to normal inserts, but use a fixed value for that fake partition key, and explicitly setting a timestamp for a record equal to start_time (don't forget to multiple by 1000 as timestamp uses microseconds). In this case it will guaranteed to be the value with the highest timestamp as Cassandra won't override it with other data with lower timestamp.
But this doesn't solve a problem with data skew, and all traffic will be handled by fixed number of nodes equal to RF.
Another alternative - use another database.
This type of query isn't valid in big data because it requires a full table scan and doesn't scale. It works in traditional relational databases because the dataset is smaller. Imagine you had billions of partitions each with thousands of rows spread across hundreds of nodes. A full table scan in a large cluster will take a very long time if it was allowed.
The error:
ORDER BY is only supported when the partition key is restricted by an EQ or an IN
gets returned because you can only sort the results provided (a) the query is restricted to a partition key, and (b) the rows are ordered by a clustering column. You cannot sort the results based on a column that is not part of the clustering key. Cheers!

Can we restrict in cassandra that a table only have limited number of records or rows?

Can we restrict in cassandra that a table only have limited number of records or rows? If we want to insert maximum 20 rows in a table then how do we do?
Cassandra does not support this kind of operation. This is part of the business logic in your application and it should be done on application level.
No, but you can make a PER PARTITION LIMIT on the query, then issue a delete periodically to create a range tombstone for everything past that range. ie in a
CREATE TABLE mytable (
primary text
clustering timestamp
value text
PRIMARY KEY ((primary), clustering)
You can SELECT * FROM mytable WHERE primary = 'mykey' PER PARTITION LIMIT 20 which then the last one has a clustering of 1548857236000 can then DELETE FROM mytable WHERE primary = 'mykey' and clustering > 1548857236000. For most part id just issue that delete very infrequently (like once an hour or a day depending on load in order to keep partition size down) and use LeveledCompactionStrategy. If enough load include a date component to the primary key like ((primary, yyyyMMdd), clustering) to prevent too much tombstone much buildup in the partition.

Filter on the partition and the clustering key with an additional criteria

I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index

High number of tombstones with TTL columns in Cassandra

I have a cassandra Column Family, or CQL table with the following schema:
CREATE TABLE user_actions (
company_id varchar,
employee_id varchar,
inserted_at timeuuid,
action_type varchar,
PRIMARY KEY ((company_id, employee_id), inserted_at)
) WITH CLUSTERING ORDER BY (inserted_at DESC);
Basically a composite partition key that is made up of a company ID and an employee ID, and a clustering column, representing the insertion time, that is used to order the columns in reverse chronological order (newest actions are at the beginning of the row).
Here's what an insert looks like:
INSERT INTO user_actions (company_id, employee_id, inserted_at, action_type)
VALUES ('acme', 'xyz', now(), 'started_project')
USING TTL 1209600; // two weeks
Nothing special here, except the TTL which is set to expire in two weeks.
The read path is also quite simple - we always want the latest 100 actions, so it looks like this:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
LIMIT 100;
The issue: I would expect that since we order in reverse chronological order, and the TTL is always the same amount of seconds on insertion - that such a query should not scan through any tombstones - all "dead" columns are at the tail of the row, not the head. But in practice we see many warnings in the log in the following format:
WARN [ReadStage:60452] 2014-09-08 09:48:51,259 SliceQueryFilter.java (line 225) Read 40 live and 1164 tombstoned cells in profiles.user_actions (see tombstone_warn_threshold). 100 columns was requested, slices=[-], delInfo={deletedAt=1410169639669000, localDeletion=1410169639}
and on rare occasions the tombstone number is large enough to abort the query completely.
Since I see this type of schema design being advocated quite often, I wonder if I'm doing something wrong here?
Your SELECT statement is not giving an explicit sort order and is hence defaulting to ASC (even though your clustering order is DESC).
So if you change your query to:
SELECT action_type FROM user_actions
WHERE company_id = 'acme' and employee_id = 'xyz'
ORDER BY inserted_at DESC
LIMIT 100;
you should be fine
Perhaps data is reappearing because a node fails and gc_grace_seconds expired already, the node comes back into the cluster, and Cassandra can't replay/repair updates because the tombstone disappeared after gc_grace_seconds: http://www.datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
The 2.1 incremental repair sounds like it might be right for you: http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html

Cassandra Data Model design for vnodes enabled cluster?

I have recently started working with Cassandra. We have cassandra cluster which is using DSE 4.0 version and has VNODES enabled. We have a tables like this -
Below is my first table -
CREATE TABLE customers (
customer_id int PRIMARY KEY,
last_modified_date timeuuid,
customer_value text
)
Read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes.
select customer_id, customer_value from datakeyspace.customers;
We have second table like this -
CREATE TABLE client_data (
client_name text PRIMARY KEY,
client_id text,
creation_date timestamp,
is_valid int,
last_modified_date timestamp
)
CREATE INDEX idx_is_valid_clnt_data ON client_data (is_valid);
Right now in the above table, we have 500 records and all those records has "is_valid" column value set as 1. And the read query pattern is like this on above table as of now since we need to get everything from above table and load it into our application memory every x minutes so the below query will return me all 500 records since everything has is_valid set to 1.
select client_name, client_id from datakeyspace.client_data where is_valid=1;
Since our cluster is VNODES enabled so my above query pattern is not efficient at all and it is taking lot of time to get the data from Cassandra. It takes around 50 seconds to get the data from cqlsh client. We are reading from these table with consistency level QUORUM.
Is there any possibility of improving our data model by using wide rows concept or anything else?
Any suggestions will be greatly appreciated.

Resources