Cassandra query using secondary index timedout - cassandra

I am facing timeout issue while executing query on Cassandra database. We have tried increasing the read timeout fields "read_request_timeout_in_ms", "range_request_timeout_in_ms" in cassandra.yaml, but still query timesout in 10secs.
Is there anyway we can increase the timeout value to 1-2 mins ?
Sample Product Table Schema:
- product_id string (primary key)
- product_name string
- created_on timestamp (secondary index)
- updated_on timestamp
Requirement: I want to query all the product which are created a particular day using 'created_on' field.
Sample Query: select * from "Product" where created_on > 1632906232 AND created_on < 1632906232
Note: Query uses the secondary index field in filter.
Environment details: Cassandra database with 2 node cluster setup.

The underlying problem is that range queries is expensive which is why it takes so long to complete. By the way, it looks like you posted the wrong query because you have the same value.
The default timeouts are in place to prevent nodes from getting overloaded by expensive queries so they don't go down. Increasing the server-side timeouts is not the right approach. And in your case, it's most likely the client-side timeout getting triggered.
You need to review your data model and create table instead partitioned by the creation date so it will perform better. Cheers!

Related

Cassanda cql issue : "Batch too large","code":8704

I am getting the below error in select query.
{"error":{"name":"ResponseError","info":"Represents an error message from the server","message":"Batch too large","code":8704,"coordinator":"10.29.96.106:9042"}}
Ahh, I get it; you're using Dev Center.
If result is more than 1000 it is showing this error
Yes, that's Dev Center preventing you from running queries that can hurt your cluster. Like this:
select * from user_request_by_country_by_processworkflow
WHERE created_on <= '2022-01-08T16:19:07+05:30' ALLOW FILTERING;
ALLOW FILTERING is a way to force Cassandra to read multiple partitions in one query, even though it is designed to warn you against doing that. If you really need to run a query like this, then you'll want to build a table with a PRIMARY KEY designed to specifically support that.
In this case, I'd recommend "bucketing" your table data by whichever time component keeps the partitions within a reasonable size. For example, if the day keeps the rows-per-partition below 50k, the primary key definition would look like this:
PRIMARY KEY (day,created_on)
WITH CLUSTERING ORDER BY (created_on DESC);
Then, a query that would work and be allowed would look like this:
SELECT * FROM user_request_by_country_by_processworkflow
WHERE day=20220108
AND created_on <= '2022-01-08T16:19:07+05:30';
In summary:
Don't run multi-partition queries.
Don't use ALLOW FILTERING.
Do build tables to match queries.
Do use time buckets to keep partitions from growing unbounded.

Error: deadlock detected in Postgres while using atomic queries

We are using Postgres as part of our backend structure in Nodejs (Using pg). This is a very high multi process environment with a bunch of microservices, where the services query the same table. There is basically a column which functions as a lock 'status' - which value is either 'pending' for unlocked, or 'in-process' for locked.
There are two queries which select data from the table and lock the corresponding rows:
UPDATE market AS m
SET status='in_process', status_update_timestamp='${timestamp}'
WHERE m.guid IN
(SELECT guid FROM market
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1 FOR UPDATE)
RETURNING *
UPDATE market AS m
SET status = 'in_process', status_update_timestamp = '${timestamp}'
WHERE m.guid IN
(SELECT guid FROM market
WHERE request_status='pending' AND asset_id IN (${myArray.join(",")})
FOR UPDATE)
RETURNING *
And one queries which unlocks rows based on guids:
UPDATE market
SET status='pending', status_update_timestamp='${timestamp}'
WHERE guid IN ('${guids.join("','")}')
There are cases where the two selecting queries can block each other, and also cases where the unlocking query and one other selecting query block eachother.
All of these queries can be executed in parallel from multiple services, and even though they are supposed to be atomic according to the documentation (link), we still get an error from postgres that 'deadlock is detected'. We tried wrapping the queries with BEGIN and END, different isolation levels, and different ORDER BYs but still without any improvement.
Is there any problem in the queries that give rise to deadlocks? Is this a problem that have to be solved in the application logic? Any help is welcome.
Table structure:
CREATE TABLE market
(
id BIGSERIAL not null constraint market_pkey primary key,
guid UUID DEFAULT uuid_generate_v4(),
asset_id BIGINT,
created_at TIMESTAMP DEFAULT current_timestamp,
status_update_timestamp TIMESTAMP DEFAULT current_timestamp,
status VARCHAR DEFAULT 'pending'
);
"Atomic" doesn't mean "can't fail". It just means that if it does fail, the whole thing gets rolled back completely.
You could solve the problem in the app by catching the deadlock errors and retrying them.
Perhaps you could redesign your transactions to be less prone to deadlock, but without knowing the rationale behind each query it is hard to suggest how you would go about doing that.

Need pagination for following Cassandra table

CREATE TABLE feed (
identifier text,
post_id int,
score int,
reason text,
timestamp timeuuid,
PRIMARY KEY ((identifier, post_id), score, id, timestamp)
) WITH CLUSTERING ORDER BY (score DESC, timestamp DESC);
CREATE INDEX IF NOT EXISTS index_identifier ON feed ( identifier );
I want to run 2 types of queries where identifier = 'user_5' and post_id = 11; and where identifier = 'user_5';
I want to paginate on 10 results per query. However, few queries can have variable result count. So best if there is something like a *column* > last_record that I can use.
Please help. Thanks in advance.
P.S: Cassandra version - 3.11.6
First, and most important - you're approaching to Cassandra like a traditional database that runs on the single node. Your data model doesn't support effective retrieval of data for your queries, and secondary indexes doesn't help much, as it's still need to reach all nodes to fetch the data, as data will be distributed between different nodes based on the value of partition key ((identifier, post_id) in your case) - it may work with small data in small cluster, but will fail miserably when you scale up.
In Cassandra, all data modelling starts from queries, so if you're querying by identifier, then it should be a partition key (although you may get some problems with big partitions if some users will produce a lot of messages). Inside partition you may use secondary indexes, it shouldn't be a problem. Plus, inside partition it's easier to organize paging. Cassandra natively support forward paging, so you just need to keep paging state between queries. In Java driver 4.6.0, the special helper class was added to support paging of results, although it may not be very effective, as it needs to read data from Cassandra anyway, to skip to the given page, but at least it's a some help. Here is example from documentation:
String query = "SELECT ...";
// organize by 20 rows per page
OffsetPager pager = new OffsetPager(20);
// Get page 2: start from a fresh result set, throw away rows 1-20, then return rows 21-40
ResultSet rs = session.execute(query);
OffsetPager.Page<Row> page2 = pager.getPage(rs, 2);
// Get page 5: start from a fresh result set, throw away rows 1-80, then return rows 81-100
rs = session.execute(query);
OffsetPager.Page<Row> page5 = pager.getPage(rs, 5);

Two SQL queries take a lot of time difference in Presto

I deployed a presto cluster, 2 workers node. But two SQL queries take a lot of time difference.
//sql1: it takes 398.12ms
SELECT count(employee_name) from employee where jobstatus=2;
// sql2: it takes 16.58s
SELECT count(employee_name) from employee where create_time > date_parse('2018-12-20','%Y-%m-%d') and create_time < date_parse('2019-12-20','%Y-%m-%d');
I guess sql2 is to load all the data of the employee table into the memory for filtering, and sql1 is directly filtered in the oracle, how to confirm? Or is there another way to locate the cause?
Presto version is 0.147. The employee is Oracle table and has 50w data, of which 36 are jobstatus=2, date_parse('2018-12-20', '%Y-%m-%d') and create_time < date_parse('2019-12-20' , '%Y-%m-%d') has 98. create_time and jobstatus all is not indexed.
No concurrent support during testing, it is sequential execution
If you are connecting Oracle from Presto then those queries will be single threaded (single JDBC connection) from Presto side which means only one worker would be active independent of the cluster size. Hence, whatever performance number you are seeing, that will be coming from the Oracle side.
Based on the given performance number, it seems create_time column is indexed and jobstatus not. Please verify that.

Query all and consistency

This is a question regarding the behavior of cassandra for a select * query.
It's more for understanding, I know that normaly I should not execute such a query.
Assuming I have 4 Nodes with RF=2.
Following table (column family):
create table test_storage (
id text,
created_on TIMESTAMP,
location int,
data text,
PRIMARY KEY(id)
);
I inserted 100 entries into the table.
Now I do a select * from test_storage via cqlsh. Doing the query multiple times I get different results, so not all entries. When changing consistency to local_quorum I always get back the complete result. Why is this so?
I assumed, despite from the performance, that I also get for consistency one all entries since it must query the whole token range.
Second issue, when I add a secondary index in this case to location, and do a query like select * from test_storage where location=1 I also get random results wiht consistency one. And always correct results when changing to consistency level local_quorum. Also here I don't understand why this happens?
When changing consistency to local_quorum I always get back the complete result. Why is this so?
Welcome to the eventual consistency world. To understand it, read my slides: http://www.slideshare.net/doanduyhai/cassandra-introduction-2016-60292046/31
I assumed, despite from the performance, that I also get for consistency one all entries since it must query the whole token range
Yes, Cassandra will query all token ranges because of the non restricted SELECT * but it will only request data from one replicas out of 2 (RF=2)
and do a query like select * from test_storage where location=1 I also get random results wiht consistency one
Same answer as above, native Cassandra secondary index is just using a Cassandra table under the hood to store the reverse-index so the same eventual consistency rules apply there too

Resources