Cassandra Range Query - cassandra

I have a set of products (product_Id, price).
'Price' of all products keep on changing and hence need to be updated very frequently.
I want to perform range query on prices:
select * from products where price > 10 and price < 100;
Please note - I want to get the products in range. Query do not matter.
What is the best possible way to model this scenario? I'm using cassandra 2.1.9.

If your price is a column key, you can only create range queries with your partition key. E.g.
Your table:
products (product_Id text, price float, PRIMARY KEY(productId, price))
Your range query:
SELECT * FROM products
WHERE productId = 'ysdf834234' AND price < 1000 AND price > 30;
But I think this query is really useless. If you need ranges for your prices and without your partition key, you need a new table. But I think a Cassandra table with 2 columns is a bad database design. In your usecase a pure key value storage is a better option. (Like Redis) But you can also add productType, productVariation, productColor, productBrand ... as columns. In this case Cassandra is a good option for you. Then you can create tables like:
productsByType_price PRIMARY KEY(productType, productPrice, productId)
productsByType_color PRIMARY KEY(productType, productColor, productId)
productsByType_brand PRIMARY KEY(productType, productBrand, productId)
etc.
One tip: Read a bit more about how cassandra manages the data. This really helps you with your data modelling.

Related

Why am I getting this error when I run the query?

When attempting to perform this query:
select race_name from sport_app.month_category_runner where race_type = 'URBAN RACE 10K' and club = 'CORNELLA ATLETIC';
I get the following error:
Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
It is an exercise, so I am not allowed to use ALLOW FILTERING.
So I have created two indexes in this way:
create index raceTypeIndex ON sport_app.month_category_runner(race_type);
create index clubIndex ON sport_app.month_category_runner(club);
But I keep getting the same error, am I missing something, or is there an alternative?
Table Structure:
CREATE TABLE month_category_runner (month text,
category text,
runner_id text,
club text,
race_name text,
race_type text,
race_date timestamp,
total_runners int,
net_time time,
PRIMARY KEY (month, category, runner_id, race_name, net_time));
Note if you add the "ALLOW FILTERING" the query will run on all the nodes of Cassandra cluster and can have a large impact on all nodes.
The recommendation is to add the partition as condition of your query, to allow the query to be executed on needed nodes only.
Example:
select race_name from month_category_runner where month = 'may' and club = 'CORNELLA ATLETIC';
select race_name from month_category_runner where month = 'may' and race_type = 'URBAN RACE 10K';
select race_name from month_category_runner where month = 'may' and race_type = 'URBAN RACE 10K' and club = 'CORNELLA ATLETIC' ALLOW FILTERING;
Your primary key is composed by (month, category, runner_id, race_name, net_time) and the column month is the partition, so this column must be on your query filter as i showed in example.
The query that you want to do using two columns that are not in primary key despite the index column exist, you need to use the ALLOW FILTERING that can have performance impact;
The other option is create a new table where the primary key contains theses columns.

One to many mapping in Cassandra

I am new to Cassandra and would like to do One to many mapping of User and its vehicle. One user may have multiple Vehicles. My User table will contain User details like name, surname, etc. And Vehicle table will have Vehicle details.
My select query will fetch all Vehicle details for particular User.
How should I design this in Cassandra?
You can easily model this in a single table:
CREATE TABLE userVehicles (
userid text,
vehicleid text,
name text static,
surname text static,
vehicleMake text,
vehicleModel text,
vehicleYear text,
PRIMARY KEY (userid,vehicleid)
);
This way you can query vehicles for a single user in one shot, and your user data can be static so that it is stored at the partition key level. As long as the cardinality of user to vehicle isn't too big (as-in, like a user has 1000 vehicles) this should work just fine.
The case I have considered above is very simple. But what if my User has lot of details around 20 to 30 fields and same for Vehicle. Still you would suggest to have a single table and copying User data for all vehicle?
It depends. Would your use case require returning all of them? If so, then "yes" I would still recommend this approach. The way to get the best query performance out of Cassandra, is to model your tables to fit your queries. Cassandra works best when it can read a single row by a specific key, or a range of rows (stored sequentially). You want to avoid performing multiple queries or writing queries that force Cassandra to perform random reads.
What are the consequences of having 2 different tables like User and Vehicle and Vehicle table will have primary key as User_Id and Vehicle_Id?
In a distributed system network time is the enemy. By having two tables, you are now making two queries...assuming a 1 to 1 ratio of users to vehicles. But if your user has 8 vehicles, you now need 9 queries to achieve your result. With the design above you can build your result set in 1 query (minimizing network time). Also with userid as a partition key, that query is guaranteed to be served by one node, as opposed to additional queries for vehicle data which will most likely require contacting multiple nodes.
This seems as simple as having two tables, one holding all of your vehicles data and another one for satisfying your query:
CREATE TABLE vehicles (
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (vehicle_type)
)
CREATE TABLE vehicles_to_users (
user_id bigint,
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (user_id, vehicle_type)
)
Then you would query by:
SELECT * FROM vehicles_to_users WHERE user_id = 9;
or something like that to get all specific vehicle type belonging to a particular user:
SELECT * FROM vehicles_to_users WHERE user_id = 9 AND vehicle_type = 1;
This is a solution with denormalized data, and you should always consider that approach instead of having something like:
CREATE TABLE vehicles (
vehicle_id bigint,
vehicle_type int,
vehicle_name text,
...
PRIMARY KEY (vehicle_type)
)
CREATE TABLE vehicles_to_users (
user_id bigint,
vehicle_id bigint,
PRIMARY KEY (user_id)
)
because it belongs to the relational databases world and you'd have to run N+1 queries to satisfy your requirements: one to get all the ids belonging to a particular user, and then N queries to get all the information for each vehicle:
SELECT * FROM vehicles_to_users WHERE user_id = 9;
SELECT * FROM vehicles WHERE vehicle_id = 115;
SELECT * FROM vehicles WHERE vehicle_id = 116;
SELECT * FROM vehicles WHERE vehicle_id = ...;
And don't be tempted to use the IN clausole like this:
SELECT * FROM vehicles WHERE vehicle_id IN (115,116,....);
because it would perform even worse due to extra work that a coordinator node have to do.

Cassandra design approach for my sample use case

Im learning cassandra from past few days. Tried to create a data model for the following use case..
"Each Zipcode in US has a list of stores sorted based on a defined rank"
"Each store/warehouse has millions of SKUs and the inventory is tracked"
"If I search using a zipcode and SKU, it should return the best possible 100 stores
with inventory, based on the rank"
Assume store count is 1000+ and sku count is in millions
Design tried
One table with
ZipCode
Rank
StoreID
primary key (ZipCode, Rank)
Another table with
Sku
Store
Inventory
Primary Key (Sku, Store)
Now, if I want to search top 100 stores for each ZipCode, SKU
combination..
I have to search in table 1 for the top 100 stores and
then pull inventory of each store from the second table.
Since the SKU count is in millions and store count is in 1000+, m not
sure if we can store all this in one table and have zipcode_sku as row
key and stores and inventory stored as wide row sorted by rank
Am I thinking right? What could be other possible data models for this use case?
UPDATE: Data Loader Code (as mentioned in below comments)
println "Loading data started.."
(1..1000000).each { // SKUs
sku = it.toString()
(1..42000).each { // Zip Codes
zipcode = it.toString().padLeft(5,"0")
(1..1500).each { // Stores
store = it.toString()
int inventory = Math.abs(new Random().nextInt() % 10000) + 1
session.execute("INSERT INTO ritz.rankedStoreByZipcodeAndSku(sku, zipcode, store, store_rank, inventory) " +
"VALUES('$sku','$zipcode','$store',$it,$inventory);")
}
}
}
println "Data Loaded"
Cassandra is a Columnar database, so you can have wide rows that you usually want to represent each kind of query you want to make. In this case
CREATE TABLE storeByZipcodeAndSku (
sku text,
zipcode int,
store text,
store_rank int,
inventory int,
PRIMARY KEY ((sku, zipcode), store)
);
This way the row key is sku + zipcode so its a very fast lookup and you can store up to 2 billion stores in it. When you update your inventory also update this table. To get the top 100 you just pull down all of them and sort (1000's is not many) but if this operation is super common and you need it faster you can instead use
CREATE TABLE rankedStoreByZipcodeAndSku (
...
PRIMARY KEY ((sku, zipcode), store_rank)
) WITH CLUSTERING ORDER BY (store_rank ASC);
to have it sorted for you automatically and you just grab the top 100. Then when you update it you will want to use the lightweight transactions to move things around atomically.
It sounds like you want to get a list of StoreID's from the first table based on ZipCode, and a list of StoreID's from the second table based on Sku, and then do a join. Since Cassandra is a simple key value store, it doesn't do join's. So you would have to either write code in your client to do the two queries and manually do the join, or connect Cassandra to spark which has a join function.
As you say, trying to denormalize the two tables into one table so that you could do this as one query might result in a very large and difficult to maintain table. If this is the only query pattern you will have, then that might be worth it, but if this is a general inventory system with a lot of different query patterns, then it might be too inflexible.
The other option would be to use an RDBMS instead of Cassandra, and then joins are super easy.

Cassandra CQL - clustering order with multiple clustering columns

I have a column family with primary key definition like this:
...
PRIMARY KEY ((website_id, item_id), user_id, date)
which will be queried using queries such as:
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id = 0 AND date > 'some_date' ;
However, I'd like to keep my column family ordered by date only, such as SELECT date FROM myCF ; would return the most recent inserted date.
Due to the order of clustering columns, what I get is an order per user_id then per date.
If I change the primary key definition to:
PRIMARY KEY ((website_id, item_id), date, user_id)
I can no longer run the same query, as date must be restricted is user_id is.
I thought there might be some way to say:
...
PRIMARY KEY ((website_id, shop_id), store_id, date)
) WITH CLUSTERING ORDER BY (store_id RANDOMPLEASE, date DESC) ;
But it doesn't seem to exist. Worst, maybe this is completely stupid and I don't get why.
Is there any ways of achieving this? Am I missing something?
Many thanks!
Your query example restricts user_id so that should work with the second table format. But if you are actually trying to run queries like
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND date > 'some_date'
Then you need an additional table which is created to handle those queries, it would only order on Date and not on user id
Create Table LookupByDate ... PRIMARY KEY ((website_id, item_id), date)
In addition to your primary query, if all you try to get is "return the most recent inserted date", you may not need an additional table. You can use "static column" to store the last update time per partition. CASSANDRA-6561
It probably won't help your particular case (since I imagine your list of all users is unmanagably large), but if the condition on the first clustering column is matching one of a relatively small set of values then you can use IN.
SELECT * FROM myCF
WHERE website_id = 30 AND item_id = 10
AND user_id IN ? AND date > 'some_date'
Don't use IN on the partition key because this will create an inefficient query that hits multiple nodes putting stress on the coordinator node. Instead, execute multiple asynchronous queries in parallel. But IN on a clustering column is absolutely fine.

Cassandra BETWEEN & ORDER BY operations

I wanted to perform SQL operations such as BETWEEN, ORDER BY with ASC/DSC order on Cassandra-0.7.8.
As I know, Cassandra-0.7.8 does not have direct support to these operations. Kindly let me know is there a way to accomplish these by tweaking on secondary index?
Below is my Data model design.
Emp(KS){
User(CF):{
bsanderson(RowKey): { eno, name, dept, dob, email }
prothfuss(RowKey): { eno, name, dept, dob, email }
}
}
Queries:
- Select * from emp where dept='IT' ORDER BY dob ASC.
- Select * from emp where eno BETWEEN ? AND ? ORDER BY dob ASC.
Thanks in advance.
Regards,
Thamizhananl
Select * from emp where dept='IT' ORDER BY dob ASC.
You can select rows where the 'dept' column has a certain value, by using the built-in secondary indexes. However, the rows will be returned in the order determined by the partitioner (RandomPartitioner or OrderPreservingPartitioner). To order by arbitrary values such as DOB, you would need to sort at the client.
Or, you could support this query directly by having a row for each dept, and a column for each employee, keyed (and therefore sorted) by DOB. But be careful of shared birthdays! And you'd still need subsequent queries to retrieve other data (the results of your SELECT *) for the employees selected, unless you denormalise so that the desired data is stored in the index too.
Select * from emp where eno BETWEEN ? AND ? ORDER BY dob ASC.
The secondary index querying in Cassandra requires at least one equality term, so I think you can do dept='IT' AND eno >=X AND eno <=y, but not just a BETWEEN-style query.
You could do this by creating your own index row, with a column for each employee, keyed on the employee number, with an appropriate comparator so all the columns are automatically sorted in employee-number order. You could then do a range query on that row to get a list of matching employees - but you would need further queries to retrieve other data for each employee (dob etc), unless you denormalise so that the desired data is stored in the index too. You would still need to do the dob ordering at the client.
As I know the columns will be sorted by comparator when you create column family and you can use clustring key for sorting on your opinion
and row in column family will be sorted by partitioner
I suggest you read this paper
Cassandra The Definitive Guide Chapter 6

Resources