Query optimization in Cassandra

Query optimization in Cassandra - cassandra

I have a cassandra database that I need to query
My table looks like this:
Cycle Parameters Value
1 a 999
1 b 999
1 c 999
2 a 999
2 b 999
2 c 999
3 a 999
3 b 999
3 c 999
4 a 999
4 b 999
4 c 999
I need to get values for parameters "a" and "b" for two cycles , no matter which "cycle" it is
Example results:
Cycle Parameters Value
1 a 999
1 b 999
2 a 999
2 b 999
or
Cycle Parameters Value
1 a 999
1 b 999
3 a 999
3 b 999
Since the database is quite huge, every query optimization is welcome..
My requirements are:
I want to do everything in 1 query
Would be a plus a answer with no nested query
So far, I was able to accomplish these requirements with something like this:
select * from table where Parameters in ('a','b') sort by cycle, parameters limit 4
However, this query needs a "sortby" operation that causes huge processing in the database...
Any clues on how to do it? ....limit by partition maybe?
EDIT:
The table schema is:
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
"parameters" is the partition key and "cycle" is the clustering column

You can't query like this without ALLOW FILTERING, don't use allow filtering in production Only use it for development!
Read the datastax doc about using ALLOW FILTERING https://docs.datastax.com/en/cql/3.3/cql/cql_reference/select_r.html?hl=allow,filter
I assume your current schema is :
CREATE TABLE data (
cycle int,
parameters text,
value double,
primary key(cycle, parameters)
)
And you need another table or change your table schema to query like these
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
Now you can query
SELECT * FROM cycle_data WHERE parameters in ('a','b');
These result will automatically sorted in ascending order by cycle for every parameters

Related

looping string list and get no record count from table

I have string values get from a table using listagg(column,',')
so I want to loop this string list and set into where clause for another table
then I want to get a count when no any records in the table (Number of times with no record)
I'm writing this inside the plsql procedure
order_id
name
10
test1
20
test2
22
test3
25
test4
col_id
product
order_id
1
pro1
10
2
pro2
30
3
pro2
38
expected result : count(Number of times with no record) in 2nd table
count = 3
because there is no any record of 20,22,25 order ids in 2nd table
only have record for order_id - 10
my queries
SELECT listagg(ord.order_id,',')
into wk_orderids
from orders ord,
where ord.id_no = wk_id_no;
loop
-- do my stuff
end loop
wk_orderids values = ('10','20','22','25')
I want to loop this one(wk_orderids) and set it one by one into a select query where clause
then want to get the count Number of times with no record

If you want to count ORDER_IDs in the 2nd table that don't exist in ORDER_ID column of the 1st table, then your current approach looks as if you were given a task to do that in the most complicated way. Aggregating values, looping through them, adding values into a where clause (which then requires dynamic SQL) ... OK, but - why? Why not simply
select count(*)
from (select order_id from first_table
minus
select order_id from second_table
);

How to get a subset of teradata table i.e. from nth row to n+3th row values

Assume I have a table A with 100 records in it in Teradata. Now I have to pass 20-20 rows 5 times to a specific process. I am struggling to segment that whole table with 100 records into 5 subparts, any clue of any SQL which can give me such data.
Example:
table A
A AA
B BB
C CC
D DD
E EE
F FF
Here I have 6 records, I want to fetch first 2 and then second 2 and then last 2 records one by one, any SQL help

If there's some unique column(s) you can apply ROW_NUMBERs:
select *
from table
QUALIFY
ROW_NUMBER() OVER (ORDER BY unique_column(s)) BETWEEN 3 AND 4
;
Of course, this is not very efficient on a big table.

Way to add same keys to delta table merge

I have a delta table. Inside this delta table, contains duplicate keys. For example:
id age
1 22
1 23
1 25
2 22
2 11
When merging a new table to the delta table that looks like this:
id age
1 23
1 24
1 23
2 21
2 12
Using this function:
def upsertToDelta(microBatchOutputDF):
(student_table.alias("t").merge(
microBatchOutputDF.alias("s"),
"s.id = t.id")
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute())
It throws an error:
Cannot perform Merge as multiple source rows matched and attempted to modify the same
I understand why this is happening, but what I'd like to know is how I can remove the old keys and insert the new keys even though the ids are the same. So the resulting table should look like this:
id age
1 23
1 24
1 23
2 21
2 12
Is there a way to do this?

This looks like SCD type 1 change, where we overwrite the old data with the new ones. To handle this, you must have atleast one unique to act as merge key. A simple row_number can also be sufficient in your case, like this:
Before Merge:
Add row_number, partitioned by id column, in new data. This is handled in the merge statement below. (Just printing here for understanding)
Merge SQL:
MERGE INTO student_table AS target
USING (
SELECT id AS merge_key, id, age
FROM microBatchOutputDF
WHERE id IN (
SELECT DISTINCT id
FROM student_table
)
UNION ALL
SELECT NULL AS merge_key, id, age
FROM microBatchOutputDF
WHERE id IN (
SELECT DISTINCT id
FROM student_table
)
) AS source
ON target.id = source.id
AND target.id = source.merge_key
WHEN MATCHED
THEN
DELETE
WHEN NOT MATCHED AND source.merge_key IS NULL
THEN
INSERT (target.id, target.row_num, target.age)
VALUES (source.id, 1, source.age)
;
The result:

How to make the query to work?

I have Cassandra version 2.0, and in it I am totally new in it, so the question...
I have table T1, with columns with names: 1,2,3...14 (for simplicity);
Partitioning key is column 1 , 2;
Clustering key is column 3, 1 , 5;
I need to perform following query:
SELECT 1,2,7 FROM T1 where 2='A';
Column 2 is a flag, so values are repeating.
I get the following error:
Unable to execute CQL query: Partitioning column 2 cannot be restricted because the preceding column 1 is either not restricted or is restricted by a non-EQ relation
So what is the right way to do it? I really need to get the data that already filtered. Thanks.

So, to make sure I understand your schema, you have defined a table T1:
CREATE TABLE T1 (
1 INT,
2 INT,
3 INT,
...
14 INT,
PRIMARY ((1, 2), 3, 1, 5)
);
Correct?
If this is the case, then Cassandra cannot find the data to answer your CQL query:
SELECT 1,2,7 FROM T1 where 2 = 'A';
because your query has not provided a value for column "1", without which Cassandra cannot compute the partition key (which requires, per your composite PRIMARY KEY definition, both columns "1" and "2"), and without that, it cannot determine where to look on which nodes in the ring. By including "2" in your partition key, you are telling Cassandra that that data is required for determine where to store (and thus, where to read) that data.
For example, given your schema, this query should work:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
since you are providing both values of your partition key.
#Caleb Rockcliffe has good advice, though, regarding the need for other, secondary/supplemental lookup mechanisms if the above table definition is a big part of your workload. You may need to find some way to first lookup the values for "1" and "2", then issue your query. E.g.:
CREATE TABLE T1_index (
1 INT,
2 INT,
PRIMARY KEY (1, 2);
);
Given a value for "1", the above will provide all of the possible "2" values, through which you can then iterate:
SELECT 2 FROM T1_index WHERE 1 = 'X';
And then, for each "1" and "2" combination, you can then issue your query against table T1:
SELECT 7 FROM T1 WHERE 1 = 'X' AND 2 = 'A';
Hope this helps!

Your WHERE clause needs to include the first element of the partition key.

Calculating median values in HIVE

I have the following table t1:
key value
1 38.76
1 41.19
1 42.22
2 29.35182
2 28.32192
3 33.66
3 33.47
3 33.35
3 33.47
3 33.11
3 32.98
3 32.5
I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:
1 41.19
2 28.83
3 33.35
However, the percentile_approx function returns these:
1 39.974999999999994
2 28.32192
3 33.23.0000000000004
Which clearly are not the median values.
This was the query I ran:
select key, percentile_approx(value, 0.5, 10000) as median
from t1
group by key
It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?

In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.
set hive.exec.parallel=true;
select temp1.key,temp2.value
from
(
select key,cast(sum(rank)/count(key) as int) as final_rank
from
(
select key,value,
row_number() over (partition by key order by value) as rank
from t1
) temp
group by key )temp1
inner join
( select key,value,row_number() over (partition by key order by value) as rank
from t1 )temp2
on
temp1.key=temp2.key and
temp1.final_rank=temp3.rank;
Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Query optimization in Cassandra - cassandra

Related

looping string list and get no record count from table

How to get a subset of teradata table i.e. from nth row to n+3th row values

Way to add same keys to delta table merge

How to make the query to work?

Calculating median values in HIVE

Categories

Resources