Query to select max value of a column using max using cassandra - cassandra

We have table called table1 with Tableindex as column. we need the maximum value of tableindex which must result as 3. Consider this table as example(table1),
Tableindex Creationdate Documentname Documentid
1 2017-01-18 11:59:06+0530 Document 1 Doc_1
2 2017-01-18 12:09:06+0530 Document 2 Doc_2
3 2017-01-18 01:09:06+0530 Document 3 Doc_3
In sql, we can select the max value using the following query,
"Select max(Tableindex) from table1;" // result in 3.
Similarly how can we get the max value in cassandra query. Is there Max function in cassandra. If not how can we get the max value?
Thanks in advance.

In Cassandra 2.2, the standard aggregate functions of min, max, avg, sum, and count are built-in functions.
SELECT MAX(column_name) FROM keyspace.table_name ;
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useQueryStdAggregate.html

If you do select * from table1 limit 1, the result will give you row with MAX Tableindex .
There are better ways to do this with cassandra. you shoud know few things while designing cassandra tables.
1- Design your table for your queries.
2- Design to distributed data across the cassandra cluster.

You can upgrade to Cassandra higher version (2.2+) , it has built in max func.

Related

Performance difference between SELECT sum(coloumn_name) FROM and SELECT coloumn_name in CQL

I like to know the performance difference in executing the following two queries for a table cycling.cyclist_points containing 1000s of rows. :
SELECT sum(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
select *
from cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
If sum(race_points) causes the query to be expensive, I will have to look for other solutions.
Performance Difference between your query :
Both of your query need to scan same number of row.(Number of row in that partition)
First query only selecting a single column, so it is little bit faster.
Instead of calculating the sum run time, try to preprocess the sum.
If race_points is int or bigint then use a counter table like below :
CREATE TABLE race_points_counter (
id uuid PRIMARY KEY,
sum counter
);
Whenever a new data inserted into cyclist_points also increment the sum with your current point.
UPDATE race_points_counter SET sum = sum + ? WHERE id = ?
Now you can just select the sum of that id
SELECT sum FROM race_points_counter WHERE id = ?

Cassandra slow SELECT MAX(x) query

I have a dev machine with Cassandra 3.9 and 2 tables, one has ~~ 400,000 records, another about 40,000,000 records. Their structures are different.
Each of them has a secondary index on a field x, and I'm trying to run a query of the form SELECT MAX(x) FROM table. On the first table, the query takes a couple of seconds, and on the second table, it times out.
My experience is with relational databases where these queries are trivial and fast. So in Cassandra, it looks like the index isn't used to execute these queries? Is there an alternative?
In cassandra aggregation functions such as MIN, MAX, COUNT, SUM or AVG on a table without specifing a partition key is a bad practice. instead, you can have an other table that store the max value of x field for both tables.
However, you have to add some client side logic to maintain this max value in the other table when you run INSERT or UPDATE statements.
Tables structures :
CREATE TABLE t1 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE t2 (
pk text PRIMARY KEY,
x int
);
CREATE TABLE agg_table (
table_name text PRIMARY KEY,
max_value int
);
So with this structure you can have the max value for a table :
SELECT max_value
FROM agg_table
WHERE table_name = 't1';
Hope this can help you.

Cassandra Range Query

I have a set of products (product_Id, price).
'Price' of all products keep on changing and hence need to be updated very frequently.
I want to perform range query on prices:
select * from products where price > 10 and price < 100;
Please note - I want to get the products in range. Query do not matter.
What is the best possible way to model this scenario? I'm using cassandra 2.1.9.
If your price is a column key, you can only create range queries with your partition key. E.g.
Your table:
products (product_Id text, price float, PRIMARY KEY(productId, price))
Your range query:
SELECT * FROM products
WHERE productId = 'ysdf834234' AND price < 1000 AND price > 30;
But I think this query is really useless. If you need ranges for your prices and without your partition key, you need a new table. But I think a Cassandra table with 2 columns is a bad database design. In your usecase a pure key value storage is a better option. (Like Redis) But you can also add productType, productVariation, productColor, productBrand ... as columns. In this case Cassandra is a good option for you. Then you can create tables like:
productsByType_price PRIMARY KEY(productType, productPrice, productId)
productsByType_color PRIMARY KEY(productType, productColor, productId)
productsByType_brand PRIMARY KEY(productType, productBrand, productId)
etc.
One tip: Read a bit more about how cassandra manages the data. This really helps you with your data modelling.

How to perform a query with multiple greater/less than condition in CQL/cassandra?

I am writing a map app.
Once I got the bounds of map I have to query the DB for points.
The structure:
CREATE TABLE map.items (
id timeuuid,
type int,
lat double,
lng double,
rank int
)
and I would like to query like this:
select * from map.items
where type=1
and (lat > 30 and lat < 35)
and (lng > 100 and lng < 110)
order by rank desc
limit 10;
Newbie to cassandra, please help or provide references.
Thank you!
CQL can only apply a range query on one clustering key, so you won't be able to do that directly in CQL.
A couple possible approaches:
Pair Cassandra with Apache Spark. Read the table into an RDD and apply a filter operation to keep only rows that match the ranges you are looking for. Then do a sort operation by rank. Then use collect() to gather and output the top ten results.
Or in pure Cassandra, partition your data by latitude. For example one partition might contain data points for all latitudes >= 30 and < 31, and so on. Your client would then do multiple queries of each latitude partition needed (i.e. 30, 31, 32, 33, and 34) and use a range query on longitude (using longitude as a clustering column). As results are returned, keep the highest ten ranked rows and discard the others.

Cassandra aggregation

I have a Cassandra cluster with 4 table and data inside.
I want to make request with aggregation function ( sum, max ...) but I've read here that it's impossible :
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/cql_function_r.html
Is there a way to make sum , average, group by, without buying the enterprise version, can I use presto , or other solutions?
Thanks
Aggregate functions will be available as part of Cassandra 3.0
https://issues.apache.org/jira/browse/CASSANDRA-4914
Sure it does:
Native aggregates
Count
The count function can be used to count the rows returned by a query.
Example:
SELECT COUNT (*) FROM plays;
SELECT COUNT (1) FROM plays;
It also can be used to count the non null value of a given column:
SELECT COUNT (scores) FROM plays;
Max and Min
The max and min functions can be used to compute the maximum and the
minimum value returned by a query for a given column. For instance:
SELECT MIN (players), MAX (players) FROM plays WHERE game = 'quake';
Sum
The sum function can be used to sum up all the values returned by a
query for a given column. For instance:
SELECT SUM (players) FROM plays;
Avg
The avg function can be used to compute the average of all the values
returned by a query for a given column. For instance:
SELECT AVG (players) FROM plays;
You can also create your own aggregates, more documentation on aggregates here: http://cassandra.apache.org/doc/latest/cql/functions.html?highlight=aggregate

Resources