I have existing data table in cassandra date table with primary key is id
SELECT * FROM Op_History ORDER BY create_time DESC limit 100;
I tried this one but getting error as :
InvalidRequest: Error from server: code=2200 [Invalid query] message="ORDER BY is only supported when the partition key is restricted by an EQ or an IN.
primary key is id
So with Cassandra, you need to design your tables to support a specific query. With the PRIMARY KEY being id, really the only query it will support is each individual row by the id.
I'd recommend building a query table for that data like this:
CREATE TABLE op_history (
id UUID,
create_time TIMESTAMP,
day_bucket INT,
op_data TEXT,
PRIMARY KEY ((day_bucket),create_time,id))
WITH CLUSTERING ORDER BY (create_time DESC, id ASC);
By partitioning on day_bucket, I'm ensuring that all data for a specific day is stored together. I'm not sure about the business case behind op_history, but if you need to query for an entire month's worth of data, then you would use something like month_bucket instead.
Now, I can filter on rows for a specific day:
> SELECT * FROM op_history WHERE day_bucket=20221221;
day_bucket | create_time | id | op_data
------------+---------------------------------+--------------------------------------+---------
20221221 | 2022-12-21 14:42:58.552000+0000 | 59b0a30b-213b-4847-bd3e-134a641be21f | Hello4!
20221221 | 2022-12-21 14:42:56.057000+0000 | 7148d5b3-77d7-4088-8c6d-f2e4c73175f2 | Hello3!
20221221 | 2022-12-21 14:42:53.866000+0000 | b23f4556-2a72-4014-a6e9-7a2ceb55217c | Hello2!
20221221 | 2022-12-21 14:42:47.738000+0000 | 51d09afa-806e-4bec-b6bf-94eb1a67910d | Hello!
(4 rows)
With the CLUSTERING ORDER defined, I won't need an ORDER BY clause.
As I dont have chance to change table creation
Oh, I'm not suggesting that. I'm suggesting that you create a new table with a different primary key definition, and load the same data into it. That's actually the best practice for data modeling in Cassandra.
is there any possibilities i.e.. ALLOW FILTERING
So using the ALLOW FILTERING directive is generally considered to be "bad practice," because it consumes too many resources. If the query has to talk to too many nodes, it could likely time out or even crash the coordinator node. Also, ALLOW FILTERING still won't let an ORDER BY to be applied to it.
One thing a lot of teams end up doing is building a Spark cluster to work with Cassandra data. Spark can pull data from Cassandra and work with it in RAM to perform ANSI-compliant SQL operations on it. That would allow you to apply an ORDER BY.
On the other hand, you could try ALLOW FILTERING and then perform the sort on the application side. Definitely not ideal.
Related
Title says all. I have a table timestampTEST
create table timestampTEST ( timestamp timestamp, test text, PRIMARY KEY(timestamp));
When trying to
select * from messagesbytimestampTEST where timestamp > '2021-01-03' and timestamp < '2021-01-04' ;
I got error
InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"
What I saw here https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/refTimeUuidFunctions.html it this sample (but I assume it is just part of the cql query):
SELECT * FROM myTable
WHERE t > maxTimeuuid('2013-01-01 00:05+0000')
AND t < minTimeuuid('2013-02-02 10:00+0000')
I know above is related to timeuuid, but I have tried it also and it yields same error.
It's not possible to do in CQL without ALLOW FILTERING. The primary reason is that in your table, primary key is the same as partition key, and to fulfill your query you need to scan data on all servers. This happens because the partition key is not ordered - the value is hashed, and used to select the server on which it will be stored. So CurrentTime-1sec will be on one server, CurrentTime-10sec - on another, etc.
Usually, for such queries people are using some external tools, like, DSBulk, or Spark with Spark Cassandra Connector. You can refer to following answers that I already provided on that topic:
Data model in Cassandra and proper deletion Strategy
Delete records in Cassandra table based on time range
I am selecting from Cassandra database using the LIKE operator on non primary key.
select * from "TABLE_NAME" where "Column_name" LIKE '%SpO%' ALLOW FILTERING;
Error from server: code=2200 [Invalid query] message="LIKE restriction is only
supported on properly indexed columns. parameter LIKE '%SpO%' is not valid."
Simply put, "yes" there is a way to query with LIKE on a non-Primary Key component. You can do this with a SASI (Storage Attached Secondary Index) Index. Here is a quick example:
CREATE TABLE testLike (key TEXT PRIMARY KEY, value TEXT) ;
CREATE CUSTOM INDEX valueIdx ON testLike (value)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS={'mode':'CONTAINS'};
As your query requires to match a string within a column, and not just a prefix or suffix you'll want to pass the CONTAINS option on index creation.
After writing some data, your query works for me:
> SELECT * FROM testlike WHERE value LIKE '%SpO%';
key | value
-----+--------------
C | CSpOblahblah
D | DSpOblahblah
(2 rows)
WARNING!!!
This query is extremely inefficient, and will probably time out in a large cluster, unless you also filter by a partition key in your WHERE clause. It's important to understand that while this functionality works similar to how a relational database would, that Cassandra is definitely not a relational database. It is simply not designed to handle queries which incur a large amount of network time polling multiple nodes for data.
Coming from Azure’s DocumentDB (Cosmos db) background to AWS DynamoDB for a application where dynamo db is already being used.
I have a confusion around partition key on DynamoDB.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions. So a query may need to be executed in multiple servers.
For an example, if I wanted to develop a multi-tenant system where I will use single table to store the all tenant’s data but partitioned using tenant id. I will do as mentioned below in document db.
1) Storing data
Create objects with following schema.
Primary key: Order Id
Partition key: Tenant id
2) Retrieving all records for a tenants
SELECT * FROM Orders o WHERE o.tenantId="tenantId"
3) Retrieving a record by id for a tenant
SELECT * FROM Orders o WHERE o.Id='id' and o.tenantId="tenantId"
4) Retrieving all records for a tenant with sorting
SELECT * FROM Orders o WHERE o.tenantId="tenantId" order by o.CreatedData
//by default all fields in document db are indexed, so order by just works
How do I achieve same operations in dynamo db?
Finally I have found how to use dynamodb properly.Thanks to [#Jesse Carter], his comment was so helpful to understand dynamo db better. I am answering my Own Question now.
Compared to other NoSQL db's DynamoDB is bit difficult as terms are too much confusing, below I have mentioned simplified dynamodb table design for few common scenarios.
Primary key
In dynamo db Primary keys are not need to be unique, I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dyanmodb) are actually "Partition key".
Finding 1
You always required to supply Primary key as part of query
Scenario 1 - Key value(s) store
Assume you want to create a table with Id, and multiple other attributes. Also you query based on Id attribute only. in this case Id could be a Primary key.
|---------------------|------------------|
| User Id | Name |
|---------------------|------------------|
| 12 | value1 |
| 13 | value2 |
|---------------------|------------------|
We can have User Id as "Primary Key (Partition Key)"
Scenario 2
Now say we want to store messages for users as shown below, and we will query by user id to retrieve all messages for user.
|---------------------|------------------|
| User Id | Message Id |
|---------------------|------------------|
| 12 | M1 |
| 12 | M2 |
| 13 | M3 |
|---------------------|------------------|
Still "User Id" shall be a primary key for this table. Please remember Primary key in dynamodb not need to be unique per document. Message Id can be Sort key
So what is Sort key.
Sort key is a kind of unique key within a partition. Combination of Partition key, and Sort key has to be unique.
Creating table locally
If you are using Visual Studio, you can install AWS tool kit for visual studio to create Local tables on your machine for testing.
Note: The above Image adds some more terms!!.
Hash key, Range key. Always surprises with dynamo db isn't? :) . Actually
(Primary Key = Partition Key = Hash Key) != Your application objects primary key
As per our second scenario "Message Id" suppose to be primary key for our application, however as per DynamoDB terms user Id became a primary key to achieve partition benefits.
(Sort key = Range Key) = Could be a application objects primary
Local Secondary Indexes
We can create indexes within partition and that is called local secondary index. For example if we want retrieve messages for user based on message status
|------------|--------------|------------|
| User Id | Message Id | Status |
|------------|--------------|------------|
| 12 | M1 | 1 |
| 12 | M2 | 0 |
| 13 | M3 | 2 |
|------------|--------------|------------|
Primary Key: User Id
Sort Key: Message Id
Secondary Local Index: Status
Global Secondary Indexes
As the name states it is a global index. If we want to retrieve single message based on id, without partition key i.e. user id. Then we shall create a global index based on Message id.
Please see the explanantion from AWS documentation,
The primary key uniquely identifies each item in a table. The primary key can be simple (partition key) or composite (partition key and sort key).
When it stores data, DynamoDB divides a table's items into multiple partitions, and distributes the data primarily based upon the partition key value. Consequently, to achieve the full amount of request throughput you have provisioned for a table, keep your workload spread evenly across the partition key values. Distributing requests across partition key values distributes the requests across partitions.
For example, if a table has a very small number of heavily accessed partition key values, possibly even a single very heavily used partition key value, request traffic is concentrated on a small number of partitions – potentially only one partition. If the workload is heavily unbalanced, meaning that it is disproportionately focused on one or a few partitions, the requests will not achieve the overall provisioned throughput level. To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values, and values are requested fairly uniformly, as randomly as possible.
This does not mean that you must access all of the partition key values to achieve your throughput level; nor does it mean that the percentage of accessed partition key values needs to be high. However, be aware that when your workload accesses more distinct partition key values, those requests will be spread out across the partitioned space in a manner that better utilizes your allocated throughput level. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
As of my understanding partition key is used segregate the data to different partitions when it grows, however many suggest using primary key as partition key, such as User Id, Customer Id, Order id. In which case I am not sure how we achieve better performance, as we have many partitions.
You are correct that the partition key is used in DynamoDB to segregate data to different partitions. However partition key and physical partition in which items recides is not a one-to-one mapping.
The number of partitions is decided based on your RCU/WCU in such a way that all RCU/WCU can be utilized.
In dynamo db Primary keys are not need to be unique. I understand this is very confusing as compare to all other products out there, but this is the fact. Primary keys (in dynamoDB) are actually "Partition key".
This is a wrong understanding. The concept of the primary key is exactly the same as SQL standards with extra restrictions as you would expect a NoSQL database to have. In short, you can have a partition key as a primary key or partition key and sort key as a composite primary key.
I caveat this question by stating: I am somewhat new to NoSQL and very new to Cassandra, but it seems like it might be a good fit for what I'm trying to do.
Say I have a list of sensors giving input at reasonable intervals. My proposed data model is to partition by the name of the sensor, where it is (area) and the date (written as yyyyMMdd), and the cluster the readings for that day by the actual time the reading occurred. The thinking is that the query for "Get all readings from sensor A on date B" should be extremely quick. So far so good I think. The table / CF looks like this in CQL:
CREATE TABLE data (
area_id int,
sensor varchar,
date ascii,
event_time timeuuid,
PRIMARY KEY ((area_id, sensor, date), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);
This doesn't however actually include any data, and I'm not sure how to add this to the model. Each reading (from the same sensor) can have a different set of arbitrary data, and I won't know ahead of time what this. E.g. I could get temperature data, I could get humidity, I could get both, or I could get something I haven't seen before. It's up to the person who actually recorded the data as to what they want to submit (it's not reading from automated sensors).
Given that I want to be doing query operations on this data (which is basically UGC) what are my options? Queries will normally consist of counts on the data (e.g. Count readings from sensor A on date B where some_ugc_valueX = C and some_ugc_valueY = D). It is worth noting that there will be more data points than would normally be queried at once. A reading could have 20 data values, but maybe only 2 or 3 would be queried - just it's unknown which ahead of time.
Currently I have thought of:
Store the data for each sensor reading in as a Map type. This would certainly make the model simple, but my understanding is that querying would then be difficult? I think I would need to pull the entire map back for each sensor reading, then check the values and count it outside of Cassandra in Storm/Hadoop/whatever.
Store each of the user values as another column (composite column with event_time uuid). This would mean not using CQL as that doesn't support adding arbitrary new columns at insert time. The Thrift API does however allow this. This means I can get Cassandra to do the counting itself.
Maybe I'm going about this the wrong way? Maybe Cassandra isn't even the best choice for this kind of data?
tl;dr. you can't chose both speed and absolute flexibility ;-)
Queries based on data from User Generated Content is going to be complex - you aren't going to be able to produce a one-size-fits-all table definition that will allow quick responses for queries based on UGC-content. Even if you choose to use Maps, Cassandra will have to deserialize the entire data structure on every query so it's not really an option for big Maps - which as you suggest in your question is likely to be the case.
An alternative might be to store the sensor data in a serialised form, e.g., json. This would give maximum flexibility in what is being stored - at the expense of being unable to make complex queries. The serialization/deserialization burden is pushed to the client and all data is sent over the wire. Here's a simple example:
Table creation (slightly simpler than your example - I've dropped date):
create table data(
area_id int,
sensor varchar,
event_time timeuuid,
data varchar,
primary key(area_id,sensor,event_time)
);
Insertion:
insert into data(area_id,sensor,event_time,data) VALUES (1,'sensor1',now(),'["datapoint1":"value1"]');
insert into data(area_id,sensor,event_time,data) VALUES (1,'sensor2',now(),'["datapoint1":"value1","count":"7"]');
Querying by area_id and sensor:
>select area_id,sensor,dateof(event_time),data from data where area_id=1 and sensor='sensor1';
area_id | sensor | dateof(event_time) | data
---------+---------+--------------------------+-------------------------
1 | sensor1 | 2013-11-06 17:37:02+0000 | ["datapoint1":"value1"]
(1 rows)
Querying by area_id:
> select area_id,sensor,dateof(event_time),data from data where area_id=1;
area_id | sensor | dateof(event_time) | data
---------+---------+--------------------------+-------------------------------------
1 | sensor1 | 2013-11-06 17:37:02+0000 | ["datapoint1":"value1"]
1 | sensor2 | 2013-11-06 17:40:49+0000 | ["datapoint1":"value1","count":"7"]
(2 rows)
(Tested using [cqlsh 4.0.1 | Cassandra 2.0.1 | CQL spec 3.1.1 | Thrift protocol 19.37.0].)
One main part of Cassandra that I don't fully understand is its range queries. I know that Cassandra emphasizes distributed environment and focuses on performance, but probably because of that, it currently only support several types of ranges queries that it can finish efficiently, and what I would like to know is that: which types of range queries are supported by Cassandra.
As far as I know, Cassandra supports the following range queries:
1: Range Queries on Primary key with keyword TOKEN, for example:
CREATE TABLE only_int (int_key int PRIMARY KEY);
...
select * from only_int where token(int_key) > 500;
2: Range Queries with one equality condition on a secondary index with keyword ALLOW FILTERING, for example:
CREATE TABLE example (
int_key int PRIMARY KEY,
int_non_key int,
str_2nd_idx ascii
);
CREATE INDEX so_example_str_2nd_idx ON example (str_2nd_idx);
...
select * from example where str_2nd_idx = 'hello' and int_non_key < 5 allow filtering;
But I am wondering if I miss something and looking for a canonical answer which lists all types of range queries supported by the current CQL (or some work-around that allows more types of range queries).
You can look for clustering keys.
A primary key can be formed by a partitioning key and then by clustering keys.
for example definition like this one
CREATE TABLE example (
int_key int,
int_non_key int,
str_2nd_idx ascii,
PRIMARY KEY((int_key), str_2nd_idx)
);
will allow to you make queries like these without using token
select * from example where str_2nd_idx < 'hello' allow filtering;
Before creating a TABLE in cassandra you should start from thinking about queries and what you want to ask from the data model in cassandra.
Apart from the queries you mentioned, you can also have queries on "Composite Key" column families (well you need to design your DB using composite keys, if that fits your constrains). For an example/discussion on this take a look at Query using composite keys, other than Row Key in Cassandra. When using Composite Keys you can perform other types of queries, namely "range" queries that do not use the "partition key" (first element of the composite key) - normally you need to set the "allow filtering" parameter to allow these queries, and also can perform "order by" operations on those elements, which can be very interesting in many situations. I do think that composite key column families allow to overcome several (necessary) "limitations" (to grant performance) of the cassandra data model when compared with the "extremely flexible" (but slow) model of RDBMS...
1) Create table:
create table test3(name text,id int,address text,num int,primary key(name,id,address))with compact storage;
2) Inserting data into table:
insert into test3(name,id,address,num) values('prasad',1,'bangalore',1) ;
insert into test3(name,id,address,num) values('prasad',2,'bangalore',2) ;
insert into test3(name,id,address,num) values('prasad',3,'bangalore',3) ;
insert into test3(name,id,address,num) values('prasad',4,'bangalore',4) ;
3)
select * from test3 where name='prasad' and id <3;
4)
name | id | address | num
--------+----+-----------+-----
prasad | 1 | bangalore | 1
prasad | 2 | bangalore | 2