Query Cassandra with Both Primary Key and Secondary Key Constraints - cassandra

I have a table in Cassandra defined as
CREATE TABLE foo ("A" text, "B" text, "C" text,
"D" text, "E" text, "F" text,
PRMIARY KEY ("A", "B"),
INDEX ("C"))
I inserted billions of records into this table. And now I want to query the table with CQL
SELECT * FROM foo WHERE "A"='abc' AND "B"='def' AND "C"='ghi'
I keep receiving 1200 error saying that
ReadTimeout: code=1200 [Coordinator node timed out waiting for replica
nodes' responses] message="Operation timed out - received only 0
responses." info={'received_responses': 0, 'required_responses': 1,
'consistency': 'ONE'}
After googling, I suspect the reason of this error is that the query is directed to some partitions that does not hold any data.
My questions are
Is there any constraint querying CQL with both primary key and secondary key specified?
If I specified the partition key in my CQL, here "A"='abc' (correct me if wrong), why C* still tries other partition that apparently does not hold the data?
Any hints to solve this timeout problem?
Thank you!

Note: For my examples, I got rid of the double-quotes around the column names. It really doesn't do anything other than preserve case in the column names (not the values) and only just serves to muck-up the works.
Is there any constraint querying CQL with both primary key and secondary key specified?
First of all, I need to clear-up what, exactly, your "primary key" and "secondary key" are. If you are referring to C as a "secondary key," then "yes" you can, with some restrictions. If you mean your partition key (A) and your cluster key (B), then yes, you can.
Querying by your partition and clustering keys (or even just your partition key(s) works:
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERe A='abc' AND B='def';
a | b | c | d | e | f
-----+-----+-----+-----+-----+-----
abc | def | ghi | jkl | mno | pqr
(1 rows)
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERe A='abc';
a | b | c | d | e | f
-----+-----+-----+-----+-----+-----
abc | ddd | ghi | jkl | mno | pqr
abc | def | ghi | jkl | mno | pqr
(2 rows)
When I create your table and index, insert a few rows, and run your query:
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERE A='abc' AND B='def' AND C='ghi';
a | b | c | d | e | f
-----+-----+-----+-----+-----+-----
abc | def | ghi | jkl | mno | pqr
(1 rows)
That works.
If I specified the partition key in my CQL, here "A"='abc' (correct me if wrong), why C* still tries other partition that apparently does not hold the data?
I don't believe that is the problem. You are restricting it to a single partition, so it should only query data off of the abc partition.
I inserted billions of records into this table.
What you are seeing, is the reason that secondary index usage is considered to be an "anti-pattern" in Cassandra. Secondary indexes do not work the same way that they do in the relational world. They just do not scale well to large clusters or data sets.
Any hints to solve this timeout problem?
Yes. Recreate your table with C as a second clustering key. And do not create an index on C.
CREATE TABLE foo (A text, B text, C text, D text, E text, F text,
PRMIARY KEY (A, B, C));
Reload your data, and then this should work for you:
aploetz#cqlsh:stackoverflow2> SELECT * FROM foo WHERE A='abc' AND B='def' AND C='ghi';
Not only should it work, but it should not timeout and it should be fast.

Related

Get the last 100 rows from cassandra table

I have a table in cassandra now i cannot select the last 200 rows in the table.
The clustering order by clause was supposed to enforce sorting on disk.
CREATE TABLE t1(id int ,
event text,
receivetime timestamp ,
PRIMARY KEY (event, id)
) WITH CLUSTERING ORDER BY (id DESC)
;
The output is unsorted by id:
event | id | receivetime
---------+----+---------------------------------
event1 | 1 | 2021-07-12 08:11:57.702000+0000
event7 | 7 | 2021-05-22 05:30:00.000000+0000
event5 | 5 | 2021-05-25 05:30:00.000000+0000
event9 | 9 | 2021-05-22 05:30:00.000000+0000
event2 | 2 | 2021-05-21 05:30:00.000000+0000
event10 | 10 | 2021-05-23 05:30:00.000000+0000
event4 | 4 | 2021-05-24 05:30:00.000000+0000
event6 | 6 | 2021-05-27 05:30:00.000000+0000
event3 | 3 | 2021-05-22 05:30:00.000000+0000
event8 | 8 | 2021-05-21 05:30:00.000000+0000
How do I overcome this problem?
Thanks
The same question was asked on https://community.datastax.com/questions/11983/ so I'm re-posting my answer here.
The rows within a partition are sorted based on the order of the clustering column, not the partition key.
In your case, the table's primary key is defined as:
PRIMARY KEY (event, id)
This means that each partition key can have one or more rows, with each row identified by the id column. Since there is only one row in each partition, the sorting order is not evident. But if you had multiple rows in each partition, you'd be able to see that they would be sorted. For example:
event | id | receivetime
---------+----+---------------------------------
event1 | 7 | 2021-05-22 05:30:00.000000+0000
event1 | 5 | 2021-05-25 05:30:00.000000+0000
event1 | 1 | 2021-07-12 08:11:57.702000+0000
In the example above, the partition event1 has 3 rows sorted by the ID column in reverse order.
In addition, running unbounded queries (no WHERE clause filter) is an anti-pattern in Cassandra because it requires a full table scan. If you consider a cluster which has 500 nodes, an unbounded query has to request all the partitions (records) from all 500 nodes to return the result. It will not perform well and does not scale. Cheers!
The ordering for a clustering order, is the order within a single partition key value, e.g. all of the rows for event1 would be in order for event1. It is not a global ordering.
From your results we can see you are selecting multiple partitions - which is why you are not seeing an order you expect.

cassandra composite index and compact storages

I am new in cassandra, have not run it yet, but my business logic requires to create such table.
CREATE TABLE Index(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, keyword, score); )
WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
Is it possible or not? I have only one column(fID) which is not part of my composite index, so i hope I will be able to apply compact_storage setting. Pay attention thet I ordered by third column of my composite index, not second. I need to compact the storage as well, so the keywords will not be repeated for each fID.
A few things initially about your CREATE TABLE statement:
It will error on the semicolon (;) after your PRIMARY KEY definition.
You will need to pick a new name, as Index is a reserved word.
Pay attention thet I ordered by third column of my composite index, not second.
You cannot skip a clustering key when you specify CLUSTERING ORDER.
However, I do see an option here. Depending on your query requirements, you could simply re-order keyword and score in your PRIMARY KEY definition, and then it would work:
CREATE TABLE giveMeABetterName(
user_id uuid,
keyword text,
score text,
fID int,
PRIMARY KEY (user_id, score, keyword)
) WITH CLUSTERING ORDER BY (score DESC) and COMPACT STORAGE;
That way, you could query by user_id and your rows (keywords?) for that user would be ordered by score:
SELECT * FROM giveMeABetterName WHERE `user_id`=1b325b66-8ae5-4a2e-a33d-ee9b5ad464b4;
If that won't work for your business logic, then you might have to retouch your data model. But it is not possible to skip a clustering key when specifying CLUSTERING ORDER.
Edit
But re-ordering of columns does not work for me. Can I do something like this WITH CLUSTERING ORDER BY (keyword asc, score desc)
Let's look at some options here. I created a table with your original PRIMARY KEY, but with this CLUSTERING ORDER. That will technically work, but look at how it treats my sample data (video game keywords):
aploetz#cqlsh:stackoverflow> SELECT * FROM givemeabettername WHERE user_id=dbeddd12-40c9-4f84-8c41-162dfb93a69f;
user_id | keyword | score | fid
--------------------------------------+------------------+-------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Assassin's creed | 87 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Battlefield 4 | 9 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | Uncharted 2 | 91 | 0
(3 rows)
On the other hand, if I alter the PRIMARY KEY to cluster on score first (and adjust CLUSTERING ORDER accordingly), the same query returns this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
Note that you'll want to change the data type of score from TEXT to a numeric (int/bigint) to avoid ASCII-betical sorting, like this:
user_id | score | keyword | fid
--------------------------------------+-------+------------------+-----
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 91 | Uncharted 2 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 9 | Battlefield 4 | 0
dbeddd12-40c9-4f84-8c41-162dfb93a69f | 87 | Assassin's creed | 0
Something that might help you, is to read through this DataStax doc on Compound Keys and Clustering.

Paging Resultsets in Cassandra with compound primary keys - Missing out on rows

So, my original problem was using the token() function to page through a large data set in Cassandra 1.2.9, as explained and answered here: Paging large resultsets in Cassandra with CQL3 with varchar keys
The accepted answer got the select working with tokens and chunk size, but another problem manifested itself.
My table looks like this in cqlsh:
key | column1 | value
---------------+-----------------------+-------
85.166.4.140 | county_finnmark | 4
85.166.4.140 | county_id_20020 | 4
85.166.4.140 | municipality_alta | 2
85.166.4.140 | municipality_id_20441 | 2
93.89.124.241 | county_hedmark | 24
93.89.124.241 | county_id_20005 | 24
The primary key is a composite of key and column1. In CLI, the same data looks like this:
get ip['85.166.4.140'];
=> (counter=county_finnmark, value=4)
=> (counter=county_id_20020, value=4)
=> (counter=municipality_alta, value=2)
=> (counter=municipality_id_20441, value=2)
Returned 4 results.
The problem
When using cql with a limit of i.e. 100, the returned results may stop in the middle of a record, like this:
key | column1 | value
---------------+-----------------------+-------
85.166.4.140 | county_finnmark | 4
85.166.4.140 | county_id_20020 | 4
leaving these to "rows" (columns) out:
85.166.4.140 | municipality_alta | 2
85.166.4.140 | municipality_id_20441 | 2
Now, when I use the token() function for the next page like, these two rows are skipped:
select * from ip where token(key) > token('85.166.4.140') limit 10;
Result:
key | column1 | value
---------------+------------------------+-------
93.89.124.241 | county_hedmark | 24
93.89.124.241 | county_id_20005 | 24
95.169.53.204 | county_id_20006 | 2
95.169.53.204 | county_oppland | 2
So, no trace of the last two results from the previous IP address.
Question
How can I use token() for paging without skipping over cql rows? Something like:
select * from ip where token(key) > token(key:column1) limit 10;
Ok, so I used the info in this post to work out a solution:
http://www.datastax.com/dev/blog/cql3-table-support-in-hadoop-pig-and-hive
(section "CQL3 pagination").
First, I execute this cql:
select * from ip limit 5000;
From the last row in the resultset, I get the key (i.e. '85.166.4.140') and the value from column1 (i.e. 'county_id_20020').
Then I create a prepared statement evaluating to
select * from ip where token(key) = token('85.166.4.140') and column1 > 'county_id_20020' ALLOW FILTERING;
(I'm guessing it would work also without using the token() function, as the check is now for equal:)
select * from ip where key = '85.166.4.140' and column1 > 'county_id_20020' ALLOW FILTERING;
The resultset now contains the remaining X rows (columns) for this IP. The method then returns all the rows, and the next call to the method includes the last used key ('85.166.4.140'). With this key, I can execute the following select:
select * from ip where token(key) > token('85.166.4.140') limit 5000;
which gives me the next 5000 rows from (and including) the first IP after '85.166.4.140'.
Now, no columns are lost in the paging.
UPDATE
Cassandra 2.0 introduced automatic paging, handled by the client.
More info here: http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
(note that setFetchSize is optional and not necessary for paging to work)

Cassandra Delete by Secondary Index or By Allowing Filtering

I’m trying to delete by a secondary index or column key in a table. I'm not concerned with performance as this will be an unusual query. Not sure if it’s possible? E.g.:
CREATE TABLE user_range (
id int,
name text,
end int,
start int,
PRIMARY KEY (id, name)
)
cqlsh> select * from dat.user_range where id=774516966;
id | name | end | start
-----------+-----------+-----+-------
774516966 | 0 - 499 | 499 | 0
774516966 | 500 - 999 | 999 | 500
I can:
cqlsh> select * from dat.user_range where name='1000 - 1999' allow filtering;
id | name | end | start
-------------+-------------+------+-------
-285617516 | 1000 - 1999 | 1999 | 1000
-175835205 | 1000 - 1999 | 1999 | 1000
-1314399347 | 1000 - 1999 | 1999 | 1000
-1618174196 | 1000 - 1999 | 1999 | 1000
Blah blah…
But I can’t delete:
cqlsh> delete from dat.user_range where name='1000 - 1999' allow filtering;
Bad Request: line 1:52 missing EOF at 'allow'
cqlsh> delete from dat.user_range where name='1000 - 1999';
Bad Request: Missing mandatory PRIMARY KEY part id
Even if I create an index:
cqlsh> create index on dat.user_range (start);
cqlsh> delete from dat.user_range where start=1000;
Bad Request: Non PRIMARY KEY start found in where clause
Is it possible to delete without first knowing the primary key?
No, deleting by using a secondary index is not supported: CASSANDRA-5527
When you have your secondary index you can select all rows from that index. When you have your rows you know the primary key and can then delete the rows.
I came here looking for a solution to delete rows from cassandra column family.
I ended up doing an INSERT and set a TTL (time to live) so that I don't have to worry about deleting it.
Putting it out there, might help someone.

Is there a way to make clustering order by data type and not string in Cassandra?

I created a table in CQL3 in the cqlsh using the following CQL:
CREATE TABLE test (
locationid int,
pulseid int,
name text, PRIMARY KEY(locationid, pulseid)
) WITH CLUSTERING ORDER BY (locationid ASC, pulseid DESC);
Note that locationid is an integer.
However, after I inserted data, and ran a select, I noticed that locationid's ascending sort seems to be based upon string, and not integer.
cqlsh:citypulse> select * from test;
locationid | pulseid | name
------------+---------+------
0 | 3 | test
0 | 2 | test
0 | 1 | test
0 | 0 | test
10 | 3 | test
5 | 3 | test
Note the 0 10 5. Is there a way to make it sort via its actual data type?
Thanks,
Allison
In Cassandra, the first part of the primary key is the 'partition key'. That key is used to distribute data around the cluster. It does this in a random fashion to achieve an even distribution. This means that you can not order by the first part of your primary key.
What version of Cassandra are you on? In the most recent version of 1.2 (1.2.2), the create statement you have used an example is invalid.

Resources