Cassandra aggregation - cassandra

I have a Cassandra cluster with 4 table and data inside.
I want to make request with aggregation function ( sum, max ...) but I've read here that it's impossible :
http://www.datastax.com/documentation/cql/3.1/cql/cql_reference/cql_function_r.html
Is there a way to make sum , average, group by, without buying the enterprise version, can I use presto , or other solutions?
Thanks

Aggregate functions will be available as part of Cassandra 3.0
https://issues.apache.org/jira/browse/CASSANDRA-4914

Sure it does:
Native aggregates
Count
The count function can be used to count the rows returned by a query.
Example:
SELECT COUNT (*) FROM plays;
SELECT COUNT (1) FROM plays;
It also can be used to count the non null value of a given column:
SELECT COUNT (scores) FROM plays;
Max and Min
The max and min functions can be used to compute the maximum and the
minimum value returned by a query for a given column. For instance:
SELECT MIN (players), MAX (players) FROM plays WHERE game = 'quake';
Sum
The sum function can be used to sum up all the values returned by a
query for a given column. For instance:
SELECT SUM (players) FROM plays;
Avg
The avg function can be used to compute the average of all the values
returned by a query for a given column. For instance:
SELECT AVG (players) FROM plays;
You can also create your own aggregates, more documentation on aggregates here: http://cassandra.apache.org/doc/latest/cql/functions.html?highlight=aggregate

Related

Trying to groupby two columns in pandas and return a max value based on criteria

I need some help with a dataset that I am trying to perform a .groupby() on to find the .max() consumption value based on Entity and Year.
enter image description hereimport
The consumption column that I am using to perform the max function, sometimes has the same value for different Years. When this occurs I would like to return the max Year for that occurrence.
df.groupby(['Entity','Year']).consumption.max().reset_index()
returns
enter image description here
In the end I would like a DataFrame with ['Entity','Year','consumption'] as the columns and when the consumption is the same for a specific Entity Year pair, to return the highest Year of the two.
I worked up a solution
df4 = df.groupby('Entity').consumption.max().reset_index()
df5 = df4.merge(df, left_on=['Entity', 'consumption'] , right_on=['Entity', 'consumption'])
df5 = df5.groupby(['Entity', 'consumption']).max().reset_index()
df5.head(10)
Unique Countries with max consumption rates, and if the consumption rate was the same year-over-year, return the max year.

Performance difference between SELECT sum(coloumn_name) FROM and SELECT coloumn_name in CQL

I like to know the performance difference in executing the following two queries for a table cycling.cyclist_points containing 1000s of rows. :
SELECT sum(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
select *
from cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
If sum(race_points) causes the query to be expensive, I will have to look for other solutions.
Performance Difference between your query :
Both of your query need to scan same number of row.(Number of row in that partition)
First query only selecting a single column, so it is little bit faster.
Instead of calculating the sum run time, try to preprocess the sum.
If race_points is int or bigint then use a counter table like below :
CREATE TABLE race_points_counter (
id uuid PRIMARY KEY,
sum counter
);
Whenever a new data inserted into cyclist_points also increment the sum with your current point.
UPDATE race_points_counter SET sum = sum + ? WHERE id = ?
Now you can just select the sum of that id
SELECT sum FROM race_points_counter WHERE id = ?

Query to select max value of a column using max using cassandra

We have table called table1 with Tableindex as column. we need the maximum value of tableindex which must result as 3. Consider this table as example(table1),
Tableindex Creationdate Documentname Documentid
1 2017-01-18 11:59:06+0530 Document 1 Doc_1
2 2017-01-18 12:09:06+0530 Document 2 Doc_2
3 2017-01-18 01:09:06+0530 Document 3 Doc_3
In sql, we can select the max value using the following query,
"Select max(Tableindex) from table1;" // result in 3.
Similarly how can we get the max value in cassandra query. Is there Max function in cassandra. If not how can we get the max value?
Thanks in advance.
In Cassandra 2.2, the standard aggregate functions of min, max, avg, sum, and count are built-in functions.
SELECT MAX(column_name) FROM keyspace.table_name ;
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useQueryStdAggregate.html
If you do select * from table1 limit 1, the result will give you row with MAX Tableindex .
There are better ways to do this with cassandra. you shoud know few things while designing cassandra tables.
1- Design your table for your queries.
2- Design to distributed data across the cassandra cluster.
You can upgrade to Cassandra higher version (2.2+) , it has built in max func.

How to refer to a single value in a Calculated Column in a measure (DAX)

I have a table(Data_all) that calculates daycount_ytd in one table.
[Date] is in Date Format.
[Fiscal Year] is just year. eg: 2016
Calculated Column
daycount_ytd=DATEDIFF("01/01/"&[Fiscal Year],Data_all[Date],day)+1
Im trying to create a measure that refers to this Calculated Column
Measure:
Amt_X Yield %:=[Amt X]/([Amt Y]/365* (Data_all[DayCount_YTD]))
I get the error that Data_all[DayCount_YTD] refers to a list of values.
How do i filter the expression to get a single value without using a aggregation function eg:(sum, median)?
Or perhaps, is there another way to achieve the same calculation?
You've arrived an a fundamental concept in DAX and once you've worked out how to deal with it then the solution generalises to loads of scenarios.
Basically you can't just pass columns into a DAX measure without wrapping them in something else - generally some kind of mathematical operation or you can use VALUES() depending on exactly what you are trying to do.
This measure will work OK if you use it in a PIVOT with the date as a row label:
=
SUM ( data_all[Amt X] )
/ (
SUM ( data_all[Amt Y] ) / 365
* MAX ( data_all[daycount_ytd] )
)
However you will see it gives you an incorrect total as it is in the latest for the entire thing. What you need is a version that iterates over the rows and then performs a calculation to SUM or AVERAGE each item. There is a whole class of DAX functions dedicated to this such as SUMX, AVERAGEX etc. You can read more about them here
It's not totally clear to me what the maths behind your 'total' should be but the following measure calculates the value for each day and sums them together:
=
SUMX(
VALUES(data_all[date]),
SUM(data_all[Amt X]) /
(SUM(data_all[Amt Y]) / 365 * MAX(data_all[daycount_ytd]))
)

fetching timeseries/range data in cassandra

I am new to Cassandra and trying to see if it fits my data query needs. I am populating test data in a table and fetching them using cql client in Golang.
I am storing time series data in Cassandra, sorted by timestamp. I store data on a per-minute basis.
Schema is like this:
parent: string
child: string
bytes: int
val2: int
timestamp: date/time
I need to answer queries where a timestamp range is provided and a childname is given. The result needs to be the bytes value in that time range(Single value, not series) I made a primary key(child, timestamp). I followed this approach rather than the column-family, comparator-type with timeuuid since that was not supported in cql.
Since the data stored in every timestamp(every minute) is the accumulated value, when I get a range query for time t1 to t2, I need to find the bytes value at t2, bytes value at t1 and subtract the 2 values before returning. This works fine if t1 and t2 actually had entries in the table. If they do not, I need to find those times between (t1, t2) that have data and return the difference.
One approach I can think of is to "select * from tablename WHERE timestamp <= t2 AND timestamp >= t1;" and then find the difference between the first and last entry in this array of rows returned. Is this the best way to do it? Since MIN and MAX queries are not supported, is there is a way to find the maximum timestamp in the table less than a given value? Thanks for your time.
Are you storing each entry as a new row with a different partition key(first column in the Primary key)? If so, select * from x where f < a and f > b is a cluster wide query, which will cause you problems. Consider adding a "fake" partition key, or use a partition key per date / week / month etc. so that your queries hit a single partition.
Also, your queries in cassandra are >= and <= even if you specify > and <. If you need strictly greater than or less than, you'll need to filter client side.

Resources