select count (*) from my_table gives me OperationTimedOut: errors={}, last_host=127.0.0.1
I have already tried to change the values in request_timeout_in_ms in cassandra.yaml and request_timeout in cqlshrc.sample. (Both are in C:\Programs\DataStax-DDC\apache-cassandra\conf) But without success.
How can I increse timeout?
select count (*) is not doing what you think. It is actually expensive as it counts the rows one by one. You can track number of records using a separate column family with a counter, which you will need to increment for every insert you do into your table. For example
CREATE TABLE IF NOT EXISTS my_table_counter (
mykey text,
count counter,
PRIMARY KEY (mykey)
);
Then for every insert into your table, do counter update:
INSERT into my_table (mykey, mydata) VALUES (?, ?);
UPDATE my_table_counter SET count = count + 1 WHERE mykey = ?;
To get the count:
SELECT count FROM my_table_counter WHERE mykey = ?
Note that counters are not idempotent, so in a rare event of a failure your data might be under or over-counted. Also the code above assumes that you only insert with a new key.
If you need a precise counting, Cassandra may be not a good fit for that. Also if you are not inserting with unique keys you may need to consider using light weight transaction with insert (IF NOT EXISTS) and update a counter only if transaction was applied.
Related
I am looking for a good way to store time specific data in cassandra.
Each entry can look like (start_time, value). Later, I would like to retrieve the current value.
Logic of retrieving current value is like following.
Find all rows with start_time<=current_time.
Then find the value with maximum start_time from the rows obtained in the first step.
PS:- Edited the question to make it more clear
The exact requirements are not possible. But we can get close to it with one more column.
First, to be able to use <= operator, your start_time column need to be the clustering key of your table.
Then, you need a different partition key. You could choose a fixed value but it could bring problems when the partition will have too many rows. Then you should better use something like the year or the month of the start_time.
CREATE TABLE time_specific_table (
year bigint,
start_time timestamp,
value text,
PRIMARY KEY((year), start_time)
) WITH CLUSTERING ORDER BY (start_time DESC);
The problem is that when you will query the table, you will need to know the value of the partition key :
Find all rows with start_time<=current_time
SELECT * FROM time_specific_table
WHERE year = :year AND start_time <= :time;
select the value with maximum start_time
SELECT * FROM time_specific_table
WHERE year = :year LIMIT 1;
Create two separate table like below :
CREATE TABLE data (
start_time timestamp,
value int,
PRIMARY KEY(start_time, value)
);
CREATE TABLE current_value (
partition int PRIMARY KEY,
value int
);
Now you have to insert data into both table, to insert data into second table use a static value like 1
INSERT INTO current_value(partition, value) VALUES(1, 10);
Now In current value table your data will be upsert and You will get latest value whenever you select.
I like to know the performance difference in executing the following two queries for a table cycling.cyclist_points containing 1000s of rows. :
SELECT sum(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
select *
from cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
If sum(race_points) causes the query to be expensive, I will have to look for other solutions.
Performance Difference between your query :
Both of your query need to scan same number of row.(Number of row in that partition)
First query only selecting a single column, so it is little bit faster.
Instead of calculating the sum run time, try to preprocess the sum.
If race_points is int or bigint then use a counter table like below :
CREATE TABLE race_points_counter (
id uuid PRIMARY KEY,
sum counter
);
Whenever a new data inserted into cyclist_points also increment the sum with your current point.
UPDATE race_points_counter SET sum = sum + ? WHERE id = ?
Now you can just select the sum of that id
SELECT sum FROM race_points_counter WHERE id = ?
I am already able to get the last row of time-series table as:
SELECT * from myapp.locations WHERE organization_id=1 and user_id=15 and date='2017-2-22' ORDER BY unix_time DESC LIMIT 1;
That works fine, however, I am wondering about performance and overhead of executing ORDER BY as rows are already sorted, I just use it to get the last row, is it an overhead in my case?
If I don't use ORDER BY, I will always get the first row in the table, so, I though I might be able to use INSERT in another way, ex: insert always in the beginning instead of end of table?
Any advice? shall I use ORDER BY without worries about performance?
Just define your clustering key order to DESC
Like the below schema :
CREATE TABLE locations (
organization_id int,
user_id int,
date text,
unix_time bigint,
lat double,
long double,
PRIMARY KEY ((organization_id, user_id, date), unix_time)
) WITH CLUSTERING ORDER BY (unix_time DESC);
So by default your data will sorted by unix_time desc, you don't need to specify in query
Now you can just use the below query to get the last row :
SELECT * from myapp.locations WHERE organization_id = 1 and user_id = 15 and date = '2017-2-22' LIMIT 1;
If your query pattern for that table is always ORDER BY unix_time DESC then you are in a reverse order time-series scenario, and I can say that your model is inaccurate (not wrong).
There's no reason not to sort the records in reverse order by adding a WITH CLUSTERING ORDER BY unix_time DESC in the table definition, and in my opinion the ORDER BY unix_time DESC will perform at most on par with something explicitly meant for these use cases (well, I think it will perform worse).
I have a table with a timestamp column, and I'd like to manipulate the values of that column. For instance, I need to do something along the line:
UPDATE mytable SET datetimecolumn = datetimecolumn + 10mins
How is it done in Apache Cassandra?
UPDATE: The answer seems to be "you can't". But the selected answer is the closest we can get apparently.
You can query similar this one, only if the data type is counter.
Using Counter :
A counter is a special column used to store a number that is changed in increments. For example, you might use a counter column to count the number of times a page is viewed.
Define a counter in a dedicated table only and use the counter data type. You cannot index, delete, or re-add a counter column. All non-counter columns in the table must be defined as part of the primary key.
Example :
CREATE TABLE mytable (
pk1 int PRIMARY KEY,
datetimecolumn counter
);
Here you have to use datetimecolumn value in millisecond.
For the first time, you have to use update query with the time in millisecond value let's say 1487686182403
UPDATE mytable SET datetimecolumn = datetimecolumn + 1487686182403 where pk1 = 1
Now mytable with pk = 1 contains datetimecolumn = 1487686182403 value.
If you want to increment datetimecolumn by 10mins (600000 millisecond)
UPDATE mytable SET datetimecolumn = datetimecolumn + 600000 where pk1 = 1
Source : https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_counter_t.html
read-before-write is a anti-pattern in cassandra. You have manipulate the value at client side and update as usual;
In other words: you have to search (select) the value, do changes (increment by 10mins) and update at cassandra the new value.
I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.