How to calculate median in Crate - median

I am using Crate 0.54.7 and have the following table definition:
CREATE TABLE test (id int PRIMARY KEY, val int);
Now I want to get the median of val. The query I used in Postgresql so far did not work:
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY val) FROM test;
Is there any way I can calculate the median in Crate?

This is not supported in Crate (as of 0.54.X).
But there are two feature requests on Github
1 Percentile Aggregation?
2 Quantiles Aggregation

Related

Autocalculation of a field value in Cassandra

There is a Cassandra table as follows.
Student(eName text, eId int PRIMARY KEY, m1 int, m2 int, average float)
How can the average value be autocalculated on inserting values for other fields of the row?
That is, we need to insert only eName, eId, m1 and m2; average has to be autocalculated and entered in the tuple.
Thanks.
Cassandra has built-in CQL aggregate functions such as AVG() which computes the average of all values returned by a query, that is, the aggregation takes place at read time (as opposed to write time).
You can also write your own user-defined aggregates (UDAs).
It is possible to implement your own CQL TRIGGER which executes Java code when data is written to a table but it has been considered as experimental for a long time and I don't recommend using it.
The general recommendation is to perform the aggregation within your application prior to writing the data to the table. Cheers!

Better way to define UDT's in Cassandra database

We are trying to remove 2 columns in a table with 3 types and make them as UDT instead of having those 2 as columns. So we came up with below two options. I just wanted to understand if there are any difference in these two UDT in Cassandra database?
First option is:
CREATE TYPE test_type (
cid int,
type text,
hid int
);
and then using like this in a table definition
test_types set<frozen<test_type>>,
vs
Second option is:
CREATE TYPE test_type (
type text,
hid int
);
and then using like this in a table definition
test_types map<int, frozen<test_type>
So I am just curious which one is a preferred option here for performance related or they both are same in general?
It's really depends on how will you use it - in the first solution you won't able to select element by cid, because to access the set element you'll need to specify the full UDT value, with all fields.
The better solution would be following, assuming that you have only one collection column:
CREATE TYPE test_type (
type text,
hid int
);
create table test (
pk int,
cid int
udt frozen<test_type>,
primary key(pk, cid)
);
In this case:
you can easily select individual element by specifying the full primary key. The ability to select individual elements from map is coming only in Cassandra 4.0. See the CASSANDRA-7396. Until that you'll need to get full map back, even if you need one element, and this will limit you on the size of the map
you can even select the range of the values, using the range query
you can get all values by specifying only partition key (pk in this example)
you can select multiple non-consecutive values by doing select * from test where pk = ... and cid in (..., ..., ...);
See the "Check use of collection types" section in the data model checks best practices doc.

Cassandra query with multiple OPTIONAL condition

Is it possible to achieve this kind of query in cassandra efficiently?
Say I have a table something
CREATE TABLE something(
a INT,
b INT,
c INT,
d INT,
e INT
PRIMARY KEY(a,b,c,d,e)
);
And I want to query this table in following way:
SELECT * FROM something WHERE a=? AND b=? AND e=?
or
SELECT * FROM something WHERE a=? AND c=? AND d=?
or
SELECT * FROM something WHERE a=? AND b=? AND d=?
and so on.
All of the above queries won't work cause cassandra require query to specify clustering column in order.
I know normally this kind of scenario would need to create some materialized view or to denormalize data into several table. However, in this case, I will need to make 4*3*2*1 = 24 tables which is basically not a viable solution.
Secondary index require that ALLOW FILTERING option must be turn on for multiple index query to work which seems to be a bad idea. Besides, there may be some high cardinality columns in the something table.
I would like to know if there is any work around to allow such a complicated query to work?
How are you ending up with 24 tables? I did not get this.
If your query has equality condition on 3 columns. Then, isn't it 10 different queries? 5c3.
Maybe I understood your requirement partially and you really need n=(24) queries. But here are my suggestions:
Figure out any columns with low cardinality and create a secondary index to satisfy at least 1 query.
Things to avoid:
Don't go with 1 base table and 23 materialized views. Keep this ratio down to 1(base) : 5 or 8(mviews). So it pays to denormalize from application side.
You may use uuid as primary key in your base table so you can use them in materialized views.
Overall, even if you have 24 queries, try to get down to 4 or 5 base tables and then create 5 or 6 materialized views on each of them to reach your intended number of 24 or whatever.
You can use SOLR along with Cassandra to get such queries to work with Cassandra. If you are using DSE, it is much more easier. In SOLR query you can directly write:
SELECT * FROM keyspace.something WHERE solr_query='a:? b:? e:?'
Refer below link which shows you all the possible combinations you can use with SOLR
https://docs.datastax.com/en/datastax_enterprise/5.0/datastax_enterprise/srch/queriesCql.html?hl=solr%2Cwhere
Writes are very efficient in C*. Also read with partition key is performant.
Create 2 table index and content :
CREATE TABLE somethingIndex(
a_index text PRIMARY KEY,
a INT
);
CREATE TABLE something(
a INT PRIMARY KEY,
b INT,
c INT,
d INT,
e INT
);
During write INSERT all combination of (a,b,c,d,e) by concatenating there values.
With 5 element with 3 combination maximum will be 11 insert : 10 INSERT in somethingIndex + 1 INSERT into something.
This will much efficient rather using solr or other solution like materialize view.
Check solr if you need full text search. For exact search above solution is efficient.
Reading data, first select "a" value from somethingIndex and then read from something table.
SELECT a FROM somethingIndex where a_index = ?; // (a+b+e) or (a+c+d) or (a+b+d);
SELECT * FROM something where a = ?;

Partial key on secondary index

I have a secondary index on a table:
CREATE NULL_FILTERED INDEX RidesByPassenger ON Rides(
passenger_id,
start_time,
)
If I run the following query:
SELECT start_time FROM Rides#{FORCE_INDEX=RidesByPassenger}
WHERE passenger_id='someid' AND start_time IS NOT NULL;
Can I be sure the base table won't be accessed for it? In other words, if I query a secondary index using only the first part of the primary key (in this case passenger_id), will it use only the secondary index? Or the base table as well? Also, is there a way to ask Spanner exactly which tables it's accessing when I run a query?
Since this query only uses columns that are covered by the index, it will not join the base table.
You can always run (EXPLAIN/PROFILE SQL_QUERY for the query plan) in gcloud tool to be sure.

Cassandra where clause on simple columns

I'm new to Cassandra and I'm having a difficulty to use a simple select query on a very basic table. For example,
SELECT * FROM cars WHERE date > '2015-10-10';
on this given table:
CREATE TABLES cars ( id int primary key, name varchar, type varchar, date varchar);
I'm able to use the = operator but not the >, < >=, <=.
I have read on this subject including this article and this overflow question on the different key types, but it is still unclear to me. In the table above, date is a SIMPLE column, why can't I use the WHERE clause like I would use it in a regular RDBMS?
In Cassandra, you can only use the WHERE clause on Keys, that's why your query doesn't work.
Take a look on this article that is similar to your problem, you'll understand that Cassandra data model isn't the same as the relational one.

Resources