Is it possible to achieve this kind of query in cassandra efficiently?
Say I have a table something
CREATE TABLE something(
a INT,
b INT,
c INT,
d INT,
e INT
PRIMARY KEY(a,b,c,d,e)
);
And I want to query this table in following way:
SELECT * FROM something WHERE a=? AND b=? AND e=?
or
SELECT * FROM something WHERE a=? AND c=? AND d=?
or
SELECT * FROM something WHERE a=? AND b=? AND d=?
and so on.
All of the above queries won't work cause cassandra require query to specify clustering column in order.
I know normally this kind of scenario would need to create some materialized view or to denormalize data into several table. However, in this case, I will need to make 4*3*2*1 = 24 tables which is basically not a viable solution.
Secondary index require that ALLOW FILTERING option must be turn on for multiple index query to work which seems to be a bad idea. Besides, there may be some high cardinality columns in the something table.
I would like to know if there is any work around to allow such a complicated query to work?
How are you ending up with 24 tables? I did not get this.
If your query has equality condition on 3 columns. Then, isn't it 10 different queries? 5c3.
Maybe I understood your requirement partially and you really need n=(24) queries. But here are my suggestions:
Figure out any columns with low cardinality and create a secondary index to satisfy at least 1 query.
Things to avoid:
Don't go with 1 base table and 23 materialized views. Keep this ratio down to 1(base) : 5 or 8(mviews). So it pays to denormalize from application side.
You may use uuid as primary key in your base table so you can use them in materialized views.
Overall, even if you have 24 queries, try to get down to 4 or 5 base tables and then create 5 or 6 materialized views on each of them to reach your intended number of 24 or whatever.
You can use SOLR along with Cassandra to get such queries to work with Cassandra. If you are using DSE, it is much more easier. In SOLR query you can directly write:
SELECT * FROM keyspace.something WHERE solr_query='a:? b:? e:?'
Refer below link which shows you all the possible combinations you can use with SOLR
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/queriesCql.html?hl=solr%2Cwhere
Writes are very efficient in C*. Also read with partition key is performant.
Create 2 table index and content :
CREATE TABLE somethingIndex(
a_index text PRIMARY KEY,
a INT
);
CREATE TABLE something(
a INT PRIMARY KEY,
b INT,
c INT,
d INT,
e INT
);
During write INSERT all combination of (a,b,c,d,e) by concatenating there values.
With 5 element with 3 combination maximum will be 11 insert : 10 INSERT in somethingIndex + 1 INSERT into something.
This will much efficient rather using solr or other solution like materialize view.
Check solr if you need full text search. For exact search above solution is efficient.
Reading data, first select "a" value from somethingIndex and then read from something table.
SELECT a FROM somethingIndex where a_index = ?; // (a+b+e) or (a+c+d) or (a+b+d);
SELECT * FROM something where a = ?;
Related
I have a question to query to cassandra collection.
I want to make a query that work with collection search.
CREATE TABLE rd_db.test1 (
testcol3 frozen<set<text>> PRIMARY KEY,
testcol1 text,
testcol2 int
)
table structure is this...
and
this is the table contents.
in this situation, I want to make a cql query has alternative option values on set column.
if it is sql and testcol3 isn't collection,
select * from rd.db.test1 where testcol3 = 4 or testcol3 = 5
but it is cql and collection.. I try
select * from test1 where testcol3 contains '4' OR testcol3 contains '5' ALLOW FILTERING ;
select * from test1 where testcol3 IN ('4','5') ALLOW FILTERING ;
but this two query didn't work...
please help...
This won't work for you for multiple reasons:
there is no OR operation in CQL
you can do only full match on the value of partition key (testcol3)
although you may create secondary indexes for fields with collection type, it's impossible to create an index for values of partition key
You need to change data model, but you need to know the queries that you're executing in advance. From brief looking into your data model, I would suggest to rollout the set field into multiple rows, with individual fields corresponding individual partitions.
But I want to suggest to take DS201 & DS220 courses on DataStax Academy site for better understanding how Cassandra works, and how to model data for it.
For example, if my primary key is a and clustering columns are b and c.
Can I only use the following in where condition?
select * from table where a = 1 and b = 2 and c = 3
Or are there any other queries that I can use?
I want to use
select * from table where a=1
and
select * from table where a = 1 and b = 2 and c = 3 and d = 4
Is that possible?
If not, then how can I model my data to make this possible?
Cassandra has lots of advantages, but it does not fit for every need.
Cassandra is a good choice, when you need to handle large amount of writes. People like it, because Cassandra is easily scalable, can handle huge datasets and highly fault tolerant.
You need to keep in mind that with Cassandra (if you really want to utilize it) the basic rule is to model your data to fit your queries. Don't model around relations. Don't model around objects. Model around your queries. This way you can minimize partition reads.
And of course you can query not just the primary keys and partition columns. You can:
add secondary index to some columns or
use the ALLOW FILTERING keyword
but of course, these are not that effective as having a well-modeled table.
For example, if my primary key is a and clustering columns are b and c.
So this translates into a definition of: PRIMARY KEY ((a),b,c). Based on that...
are there any other queries that I can use?
Yes. Some important points to understand; is that the query's WHERE clause with PRIMARY KEYs:
Must be specified in order.
Cannot be skipped.
Can be omitted, as long as the keys prior to it are specified.
select * from table where a=1
Yes, this query will work. That's because you're still querying by your partition key (a).
select * from table where a = 1 and b = 2 and c = 3 and d = 4
However, this will not work. That is because d is not (based on my understanding of your first statement) a part of your PRIMARY KEY definition.
If not, then how can I model my data to make this possible?
As Andrea mentioned, you should build your table according to the queries it needs to support. So if you need to query by a, b, c, and d, you'll need to make d a part of your PRIMARY KEY.
I have a table as below
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),start,id)
);
I want to run this query
Select * from test where day=1 and start > 1475485412 and start < 1485785654
and action='accept' ALLOW FILTERING
Is this ALLOW FILTERING efficient?
I am expecting that cassandra will filter in this order
1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result.
So the allow filtering will not be a bad choice on this query.
In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work?
Please explain.
Is this ALLOW FILTERING efficient?
When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.
I am expecting that cassandra will filter in this order...
Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").
As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:
CREATE TABLE test (
day int,
id varchar,
start int,
action varchar,
PRIMARY KEY((day),action,start,id)
);
You then would rewrite your query without any ALLOW FILTERING clause:
SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654
having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.
In general ALLOW FILTERING is not efficient.
But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.
In your case cassandra do not need filtering upto :
By the range column(start) on the 1's result
As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.
Now, keep following in mind
If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.
On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.
So ensure this first. If it works in you favour, use FILTERING.
Otherwise, it would be wise to add secondary index on 'action'.
PS : There is some minor edit.
This question is I hope not answered in the usual "secondary index v. clustering key" questions.
Here is a simple model I have:
CREATE TABLE ks.table1 (
name text,
timestamp int,
device text,
value int,
PRIMARY KEY (md_name, timestamp, device)
)
Basically I view my data as datasets with name name, each dataset is a kind of sparse 2D matrix (rows = timestamps, columns = device) containing value.
As the problem and the queries can be pretty symmetric (ie. is my "matrix" the best representation, or should I use the transposed "matrix") I couldn't decide easily what clustering key I should put first. It makes a bit more sense the way I did: for each timestamp I have a set of data (values for each devices present at that timestamp).
The usual query is then
select * from cycles where md_name = 'xyz';
It targets a single partition, that will be super fast, easy enough. If there's a large amount of data my users could do something like this instead:
select * from cycles where md_name = 'xyz' and timestamp < n;
However I'd like to be able to "transpose" the problem and do this:
select * from cycles where md_name = 'xyz' and device='uvw';
That means I have to create a secondary index on device.
But (and that's where the question starts"), this index is a bit different from usual indexes, as it is used for queries inside a single partition. Create the index allows to do the same on multiple partitions:
select * from cycles where device='uvw'
Which is not necessary in my case.
Can I improve my model to support such queries without too much duplication?
Is there something like a "per-partition index"?
The index would allow you to do queries like this:
select * from cycles where md_name='xyz' and device='uvw'
But that would return all timestamps for that device in the xyz partition.
So it sounds like maybe you want two views of the data. Once based on name and time range, and one based on name, device, and time range.
If that's what you're asking, then you probably need two tables. If you're using C* 3.0, then you could use the materialized views feature to create the second view. If you're on an earlier version, then you'd have to create the two tables and do a write to each table in your application.
I have a Cassandra table that is created like:
CREATE TABLE table(
num int,
part_key int,
val1 int,
val2 float,
val3 text,
...,
PRIMARY KEY((part_key), num)
);
part_key is 1 for every record, because I want to execute range queries and only got one server (I know that's not a good use case). num is the record number from 1 to 1.000.000. I can already run queries like
SELECT num, val43 FROM table WHERE part_key=1 and num<5000;
Is it possible to do some more filtering in Cassandra, like:
... AND val45>463;
I think it's not possible like that, but can somebody explain why?
Right now I do this filtering in my code, but are there other possibilities?
I hope I did not miss a post that already explains this.
Thank you for your help!
Cassandra range queries are only possible on the last clustering column specified by the query. So, if your pk is (a,b,c,d), you can do
... where a=2, b=4, c>5
... where a=2, b>4
but not
... where a=2, c>5
This is because data is stored in partitions, index by partition key (the first key of the pk), and then sorted by each successive clustering key.
If you have exact values, you can add a secondary index to val 4 and then do
... and val4=34
but that's about it. And even then, you want to hit a partition before applying the index. Otherwise you'll get a cluster wide query that'll likely timeout.
The querying limitations are there due to the way cassandra stores data for fast insert and retrieval. All data in a partition is held together, so querying inside the partition client side is usually not a problem, unless you have very large wide rows (in which case, perhaps the schema should be reviewed).
Hope that helps.