We are running Apache Cassandra 2.1.X and using Datastax driver. I've a use case where we need keep a count of various things. I came up with schema something like this:
create table count{
partitionKey bigInt,
type text,
uniqueId uuid,
primary_key(partitionKey, type, uniqueId)
So this is nothing but wide rows. My question is if I do something like
select count(uniqueId) from count where paritionKey=987 and type='someType' and this returns back with say 150k count.
Will it be a expensive operation for Cassandra? Is there a better way to compute count like this. I also want to know if anyone has solved something like this before?
I would prefer to stay away from keeping counter as it's not that accurate and keeping count at application level is anyways doomed to fail.
Also it will great to know how does Cassandra internally compute such data.
A big thanks to folks who help the community!
Even if you specify partition key cassandra still needs to read 150k cell to give you the count
If you haven't specify the partition key cassandra needs to scan all the node's all row to give you the count.
Best Approach is to use counter table.
CREATE TABLE id_count (
partitionkey bigint,
type text,
count counter,
PRIMARY KEY ((partitionkey, type))
);
Whenever a uniqueId insert increment the count here.
Related
I just begin study cassandra.
It was a table and queries.
CREATE TABLE finance.tickdata(
id_symbol int,
ts timestamp,
bid double,
ask double,
PRIMARY KEY(id_symbol,ts)
);
And query is successful,
select ts,ask,bid
from finance.tickdata
where id_symbol=3
order by ts desc;
Next it was decision move id_symbol in table name, new table(s) scripts.
CREATE TABLE IF NOT EXISTS mts_src.ticks_3(
ts timestamp PRIMARY KEY,
bid double,
ask double
);
And now query fails,
select * from mts_src.ticks_3 order by ts desc
I read from docs, that I need use and filter (WHERE) by primary key (partition key),
but technically my both examples same. Why cassandra so restricted in this aspect?
And one more question, It is good idea in general? move id_symbol in table name -
potentially it can be 1000 of unique id_symbol and a lot of data for each. Separate this data on individual tables look like good idea!? But I lose order by possibility, that is so necessary for me to take fresh data by each symbol_id.
Thanks.
You can't sort on the partition key, you can sort only on clustering columns inside the single partition. So you need to model your data accordingly. But you need to be very careful not to create very large partitions (when using ticker_id as partition key, for example). In this case you may need to create a composite keys, like, ticker_id + year, or month, depending on how often you're inserting the data.
Regarding the table per ticker, that's not very good idea, because every table has overhead, it will lead to increased resource consumption. 200 tables is already high number, and 500 is almost "hard limit"
i need to select 'N'th row from cassandra table based on the particular number i'm getting from my logic. i.e: if logic output is 23 means, i need to get 23rd row details. since there is no auto increment in cassandra,i cant able to go with ID key match. In SQL they getting it using OFFSET and LIMIT. i dont know how to achieve this feet in Cassandra.
Can we achieve this by using any UDF concept??? Someone reply me the solution.Thanks in advance.
Table Schema :
CREATE TABLE new_card (
customer_id bigint,
card_number text,
active tinyint,
auto_pay int,
available_credit_limit double,
average_card_spend_half_yearly double,
average_card_spend_monthly double,
average_card_spend_quarterly double,
average_card_spend_yearly double,
avg_half_yearly_spend_mcc double,
PRIMARY KEY (customer_id, card_number)
);
If you are using Java driver, refer Paging
Note, Cassandra does not support direct offsets, pages are read sequentially. If you have to use offsets to be used in your queries, you might want to revisit your data model. You could have created a composite partition key including the row number as an additional column on top of you existing partition key column.
You simply can't select N row from table, because Cassandra table is made from partitions, and you can order your rows within partition, but not the partitions. Going with paging will go throw all tables, but there's will be no chronological order of the rows selected using suck approach (disregarding the fact that the partitions can change while you doing your go-throw-pages stuff).
If you want to select row number N from Cassandra, you need to implement auto increment field on the application level and use it as key.
There's ways to do it with Cassandra, using lightweight transactions for example, but it have high cost from performance perceptive. See several solutions here:
How to create auto increment IDs in Cassandra
I have a Cassandra table that is created like:
CREATE TABLE table(
num int,
part_key int,
val1 int,
val2 float,
val3 text,
...,
PRIMARY KEY((part_key), num)
);
part_key is 1 for every record, because I want to execute range queries and only got one server (I know that's not a good use case). num is the record number from 1 to 1.000.000. I can already run queries like
SELECT num, val43 FROM table WHERE part_key=1 and num<5000;
Is it possible to do some more filtering in Cassandra, like:
... AND val45>463;
I think it's not possible like that, but can somebody explain why?
Right now I do this filtering in my code, but are there other possibilities?
I hope I did not miss a post that already explains this.
Thank you for your help!
Cassandra range queries are only possible on the last clustering column specified by the query. So, if your pk is (a,b,c,d), you can do
... where a=2, b=4, c>5
... where a=2, b>4
but not
... where a=2, c>5
This is because data is stored in partitions, index by partition key (the first key of the pk), and then sorted by each successive clustering key.
If you have exact values, you can add a secondary index to val 4 and then do
... and val4=34
but that's about it. And even then, you want to hit a partition before applying the index. Otherwise you'll get a cluster wide query that'll likely timeout.
The querying limitations are there due to the way cassandra stores data for fast insert and retrieval. All data in a partition is held together, so querying inside the partition client side is usually not a problem, unless you have very large wide rows (in which case, perhaps the schema should be reviewed).
Hope that helps.
I am trying to figure out what would be the best way to implement a valid from/to data filtering in Cassandra.
I need to have a table with records that are only valid in certain time window - always defined. Each of such records would not be valid for more than lets say: 3 months.
I would like to have a structure like this (more less ofc):
userId bigint,
validFrom timestamp ( or maybe split into columns like: from_year, from_month etc. if that helps )
validTo timestamp ( or as above )
someCollection set
All queries would be performed by userId, validFrom, validTo.
I know the limits of querying in Cassandra (both PK and clustering keys) but maybe I am missing some trick or clever usage of what is available in CQL.
Any help appreciated!
You could just select by validFrom but TTL the data by the validTo to make sure the number of records you need to filter in your app doesn't get too large. However, depending on how many records you have per user this may result in a lot of tombstones.
I have to create a table which stores a big amount of data (like 400 columns and 5.000.000 to 40.000.000 rows). There is a counter "counter" which counts from 1 upwards. Right now this is my primary key. The other variables are int, float, and varchar type and repeating.
I need to do this for a database-comparison, so I have to use Cassandra, even if there could be other databases, that can do better in this specific problem.
On this table I want to execute some range queries. The queries should be like:
SELECT counter, val1, val2, val3 FROM table WHERE counter > 1000 AND counter < 5000;
Also there will be other filter-parameters:
... AND val54 = 'OK';
I think this is a problem in Cassandra, because "counter" is the PK. I will try running the token() function, but I guess this will be slow.
Right now I am learning about the data modelling in Cassandra but I hope somebody with experience in Cassandra got some hints for me, like how to organize the table and make the queries possible and fast? Perhaps just some topics I should learn about or links that will help me.
Have a nice day,
Friedrich
This sounds like a bad use case for Cassandra.
First, range queries are discouraged in Cassandra. This is because the range can't be resolved with out visiting every node in the cluster.
Second, you can't mix a counter type column with other column types. For a given table it can either have (and only have) counter columns or it can have all non-counter columns.
As far as Cassandra data modeling goes, if you want to create a successful data model, create your partitions around the exact thing you're going to query.