Cassandra-secondary index on part of the composite key? - cassandra

I am using a composite primary key consisting of 2 strings Name1, Name2, and a timestamp (e.g. 'Joe:Smith:123456'). I want to query a range of timestamps given an equality condition for either Name1 or Name2.
For example, in SQL:
SELECT * FROM testcf WHERE (timestamp > 111111 AND timestamp < 222222 and Name2 = 'Brown');
and
SELECT * FROM testcf WHERE (timestamp > 111111 AND timestamp < 222222 and Name1 = 'Charlie);
From my understanding, the first part of the composite key is the partition key, so the second query is possible, but the first query would require some kind of index on Name2.
Is it possible to create a separate index on a component of the composite key? Or am I misunderstanding something here?

You will need to manually create and maintain an index of names if you want to use your schema and support the first query. Given this requirement, I question your choice in data model. Your model should be designed with your read pattern in mind. I presume you are also storing some column values as well that you want to query by timestamp. If so, perhaps the following model would serve you better:
"[current_day]:Joe:Smith" {
123456:Field1 : value
123456:Field2 : value
123450:Field1 : value
123450:Field2 : value
}
With this model you can use the current day (or some known day) as a sentinel value, then filter on first and last names. You can also get a range of columns by timestamp using the composite column names.

Related

Cassandra asking to allow filter even after mentioning all partition key in query?

I have been trying to model a data in Cassandra, and was trying to filter the data based on date in that, as given by the answer here on SO, Here second answer is not using allow filter.
This is my current schema,
CREATE TABLE Banking.BankData(acctID TEXT,
email TEXT,
transactionDate Date ,
transactionAmount double ,
balance DOUBLE,
currentTime timestamp ,
PRIMARY KEY((acctID, transactionDate), currentTime )
WITH CLUSTERING ORDER BY (currentTime DESC);
Now have inserted a data by
INSERT INTO banking.BankData(acctID, email, transactionDate, transactionAmount, balance, currentTime) values ('11', 'alpitanand20#gmail.com','2013-04-03',10010, 10010, toTimestamp(now()));
Now when I try to query, like
SELECT * FROM banking.BankData WHERE acctID = '11' AND transactionDate > '2012-04-03';
It's saying me to allow filtering, however in the link mentioned above, it was not the case.
The final requirement was to get data by year, month, week and so on, thats why had taken to partition it by day, but date range query is not working.
Any suggestion in remodel or i am doing something wrong ?
Thanks
Cassandra supports only equality predicate on the partition key columns, so you can use only = operation on it.
Range predicates (>, <, >=, <=) are supported only only on the clustering columns, and it should be a last clustering column of condition.
For example, if you have following primary key: (pk, c1, c2, c3), you can have range predicate as following:
where pk = xxxx and c1 > yyyy
where pk = xxxx and c1 = yyyy and c2 > zzzz
where pk = xxxx and c1 = yyyy and c2 = zzzz and c3 > wwww
but you can't have:
where pk = xxxx and c2 > zzzz
where pk = xxxx and c3 > zzzz
because you need to restrict previous clustering columns before using range operation.
If you want to perform a range query on this data, you need to declare corresponding column as clustering column, like this:
PRIMARY KEY(acctID, transactionDate, currentTime )
in this case you can perform your query. But because you have time component, you can simply do:
PRIMARY KEY(acctID, currentTime )
and do the query like this:
SELECT * FROM banking.BankData WHERE acctID = '11'
AND currentTime > '2012-04-03T00:00:00Z';
But you need to take 2 things into consideration:
your primary should be unique - maybe you'll need to add another clustering column, like, transaction ID (for example, as uuid type) - in this case even 2 transactions happen into the same millisecond, they won't overwrite each other;
if you have a lot of transactions per account, then you may need to add an another column into partition key. For example, year, or year/month, so you don't have big partitions.
P.S. In linked answer use of non-equality operation is possible because ts is clustering column.

Cassandra CQL: Filter the rows between a range of values

The structure of my column family is something like
CREATE TABLE product (
id UUID PRIMARY KEY,
product_name text,
product_code text,
status text,//in stock, out of stock
mfg_date timestamp,
exp_date timestamp
);
Secondary Index is created on status, mfg_date, product_code and exp_date fields.
I want to select the list of products whose status is IS (In Stock) and the manufactured date is between timestamp xxxx to xxxx.
So I tried the following query.
SELECT * FROM product where status='IS' and mfg_date>= xxxxxxxxx and mfg_date<= xxxxxxxxxx LIMIT 50 ALLOW FILTERING;
It throws error like No indexed columns present in by-columns clause with "equals" operator.
Is there anything I need to change in the structure? Please help me out. Thanks in Advance.
cassandra is not supporting >= so you have to change the value and have to use only >(greater then) and <(lessthen) for executing query.
You should have at least one "equals" operator on one of the indexed or primary key column fields in your where clause, i.e. "mfg_date = xxxxx"

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Timestamp / date as key for cassandra column family / hector

I have to create and query a column family with composite key as [timestamp,long]. Also,
while querying I want to fire range query for timestamp (like timestamp between xxx and yyy) Is this possible ?
Currently I am doing something really funny (Which I know its not correct). I create keys with timestamp string for given range and concatenate with long.
like ,
1254345345435-1234
3423432423432-1234
1231231231231-9999
and pass set of keys to hector api. (so if i have date range for 1 month and I want every minute data, i create 30 * 24 * 60 * [number of secondary key - long])
I can solve concatenation issue with composite key. But query part is what I am trying to understand.
As far as I understood, As we are using RandomPartitioner we cannot really query based on range as keys are MD5 checksum. Whats ideal design for this kind of use case ?
my schema and requirements are as follows : (actual csh)
CREATE TABLE report(
ts timestamp,
user_id long,
svc1 long,
svc2 long,
svc3 long,
PRIMARY KEY(ts, user_id));
select from report where ts between (123445345435 and 32423423424) and user_id is in (123,567,987)
You cannot do range queries on the first component of a composite key. Instead, you should write a sentinel value such as a daystamp (the unix epoch at midnight on the current day) as the key, then write a composite column as timestamp:long. This way you can provide the keys that comprise your range, and slice on the timestamp component of the composite column.
Denormalize! You must model your schema in a manner that will enable the types of queries you wish to perform. We create a reverse (aka inverted, inverse) index for such scenarios.
CREATE TABLE report(
KEY uuid PRIMARY KEY,
svc1 bigint,
svc2 bigint,
svc3 bigint
);
CREATE TABLE ReportsByTime(
KEY ascii PRIMARY KEY
) with default_validation=uuid AND comparator=uuid;
CREATE TABLE ReportsByUser(
KEY bigint PRIMARY KEY
)with default_validation=uuid AND comparator=uuid;
See here for a nice explanation. What you are doing now is generating your own ascii key in the times table, to enable yourself to perform the range slice query you want - it doesn't have to be ascii though just something you can use to programmatically generate your own slice keys with.
You can use this approach to facilitate all of your queries, this likely isn't going to suit your application directly but the idea is the same. You can squeeze more out of this by adding meaningful values to the column keys of each table above.
cqlsh:tester> select * from report;
KEY | svc1 | svc2 | svc3
--------------------------------------+------+------+------
1381b530-1dd2-11b2-0000-242d50cf1fb5 | 332 | 333 | 334
13818e20-1dd2-11b2-0000-242d50cf1fb5 | 222 | 223 | 224
13816710-1dd2-11b2-0000-242d50cf1fb5 | 112 | 113 | 114
cqlsh:tester> select * from times;
KEY,1212051037 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5 | 1381b530-1dd2-11b2-0000-242d50cf1fb5,1381b530-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051035 | 13816710-1dd2-11b2-0000-242d50cf1fb5,13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
KEY,1212051036 | 13818e20-1dd2-11b2-0000-242d50cf1fb5,13818e20-1dd2-11b2-0000-242d50cf1fb5
cqlsh:tester> select * from users;
KEY | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
-------------+--------------------------------------+--------------------------------------
23123123231 | 13816710-1dd2-11b2-0000-242d50cf1fb5 | 13818e20-1dd2-11b2-0000-242d50cf1fb5
Why don't you use wide rows, where Key is timestamp and Column Name as Long-Value then you can pass multiple key's (timestamp's) to getKeySlice and select multiple column's to withColumnSlice by there name (which is id).
As I don't know what is column name and value, I feel this can help you. Can you provide more details of your column family definition.

Cassandra BETWEEN & ORDER BY operations

I wanted to perform SQL operations such as BETWEEN, ORDER BY with ASC/DSC order on Cassandra-0.7.8.
As I know, Cassandra-0.7.8 does not have direct support to these operations. Kindly let me know is there a way to accomplish these by tweaking on secondary index?
Below is my Data model design.
Emp(KS){
User(CF):{
bsanderson(RowKey): { eno, name, dept, dob, email }
prothfuss(RowKey): { eno, name, dept, dob, email }
}
}
Queries:
- Select * from emp where dept='IT' ORDER BY dob ASC.
- Select * from emp where eno BETWEEN ? AND ? ORDER BY dob ASC.
Thanks in advance.
Regards,
Thamizhananl
Select * from emp where dept='IT' ORDER BY dob ASC.
You can select rows where the 'dept' column has a certain value, by using the built-in secondary indexes. However, the rows will be returned in the order determined by the partitioner (RandomPartitioner or OrderPreservingPartitioner). To order by arbitrary values such as DOB, you would need to sort at the client.
Or, you could support this query directly by having a row for each dept, and a column for each employee, keyed (and therefore sorted) by DOB. But be careful of shared birthdays! And you'd still need subsequent queries to retrieve other data (the results of your SELECT *) for the employees selected, unless you denormalise so that the desired data is stored in the index too.
Select * from emp where eno BETWEEN ? AND ? ORDER BY dob ASC.
The secondary index querying in Cassandra requires at least one equality term, so I think you can do dept='IT' AND eno >=X AND eno <=y, but not just a BETWEEN-style query.
You could do this by creating your own index row, with a column for each employee, keyed on the employee number, with an appropriate comparator so all the columns are automatically sorted in employee-number order. You could then do a range query on that row to get a list of matching employees - but you would need further queries to retrieve other data for each employee (dob etc), unless you denormalise so that the desired data is stored in the index too. You would still need to do the dob ordering at the client.
As I know the columns will be sorted by comparator when you create column family and you can use clustring key for sorting on your opinion
and row in column family will be sorted by partitioner
I suggest you read this paper
Cassandra The Definitive Guide Chapter 6

Resources