How can I use PARTITION and RANK functionality through SubSonic - subsonic

How can I write an query or lambda expression using subsonic with the following functions which are easily done through SQL SERVER
Using PARTITION and RANK in your criteria
Here is the query which I wanted to convert through SubSonic
SELECT * FROM
(
SELECT H.location_id. L.item_id AS po_item, H.po_no, H.order_date, H.created_by,
RANK() OVER (PARTITION BY H.location_id, L.item_id ORDER BY H.location_id, L.item_id, H.order_date DESC) AS Rank
FROM p21_view_po_hdr H INNER JOIN p21_view_po_line L
ON H.po_no = L.po_no
) tmp

I found an answer from below useful links:
Converting SQL Rank() to LINQ, or alternative
and
http://smehrozalam.wordpress.com/tag/ranking-functions/
In LINQ, similar result can be achieved by using the let keyword. Here’s an example:
1
2
3
4
5
6
7
8
from p in PersonOrders
//where conditions or joins with other tables to be included here
group p by p.PersonID into grp
let MaxOrderDatePerPerson = grp.Max ( g=>g.OrderDate )
from p in grp
where p.OrderDate == MaxOrderDatePerPerson
select p

Related

Cassandra(Amazon keyspace) Query Error on clustered columns

I am trying execute query on clustering columns on amazon keyspace, since I don't want to use ALLOW FILTERING with my native query I have created 4-5 clustering columns for better performance.
But while trying to filter it based on >= and <= with on 2 clustering columns, I am getting error with below message
message="Clustering column "start_date" cannot be restricted (preceding column "segment_id" is restricted by a non-EQ relation)"
I had also tried with multiple columns query but I am getting not supported error
message="MultiColumn relation is not yet supported."
Query for the reference
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and (segment_id, start_date,end_date)>= (-1, '2022-05-16','2017-03-28') and flag = 1;
or
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and segment_id > -1 and start_date >='2022-05-16';
I am assuming that the your table has the following primary key:
CREATE TABLE table_name (
...
PRIMARY KEY(shard_id, division, customer_id, segment_id, start_date, end_date)
)
In any case, your CQL query is invalid because you can only apply an inequality operator on the last clustering column in your query. For example, these are valid queries based on your table schema:
SELECT * FROM table_name
WHERE shard_id = ? AND division = ?
AND customer_id <= ?
SELECT SELECT * FROM table_name \
WHERE shard_id = ? AND division = ? \
AND customer_id = ? AND segment_id > ?
SELECT SELECT * FROM table_name \
WHERE shard_id = ? AND division = ? \
AND customer_id = ? AND segment_id = ? AND start_date >= ?
All preceding columns must be filtered by an equality operator except for the very last clustering column in your query.
If you require a complex predicate for your queries, you will need to index your Cassandra data with tools such as Elasticsearch or Apache Solr. They will allow you to run complex search parameters to retrieve data from your database. Cheers!
ALLOW Filtering gets a bad rap sometimes. It all depends on how many rows you end up scanning. It's good to understand how many rows per partition will be scanned and work backwards from there. Only the last column can contain inequality statements to bound ranges. Try to order your columns to eliminate the most columns first, which reduce the number of rows 'Filtered'.
In the example below we used the index for keys up to start date and filtered on end_data, segment_id, and flag.
select * from table_name where shard_id = 568 and division = '10' and customer_id = 568113 and start_date >= '2022-05-16' and end_date > '2017-03-28') and (segment_id > -1 flag = 1;```

Error Snowflake - Unsupported subquery type cannot be evaluated

I am facing an error in snowflake saying "Unsupported subquery type cannot be evaluated" after for example executing the below statement. How should write this statement to avoid this error?
select A
from (
select b
, c
FROM test_table
) ;
The outer query column list needs to be within the column list of the subquery. example: select b from (select b,c from test_table);
ignoring "columns" the query you have shown will never trigger this error.
You would get it from this form though:
select A.*
from tableA as A
where a.x = (select b.y FROM test_table as b where b.z = a.z)
this form assuming there is only 1 b.y per b.z can be turned into a inner join like
select A.*
from tableA as A
join test_table as b
on b.z = a.z and a.x = b.y
other forms of this pattern do the likes of max(b.y) and those can be made into a sub-select like:
select A.*
from tableA as A
join (
select c.z, max(c.y) from test_table as c group by 1
) as b
on b.z = a.z and a.x = b.y
but the general pattern is, in other databases there is no "cost" to do row-by-row queries, where-as Snowflake is more optimal with pre-building tables of similar data, and then equi-joining those results together. So both the "how-to-write" example pivot from a for-each-row thinking to a build the set of all possible answers, and then join that. This allows for the most parallel processing of the data possible. And while it means you the develop need to understand your data to get he best performance out of it, in general if you are doing large scale data processing, you should be understanding your data. So this costs, is rather acceptable imho.
If you are trying to Match Two Attributes on the Subquery.
Use like below:
If both need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) AND a IN ( select c FROM test_table )
If any one need to matched:
select * from Table WHERE a IN ( select b FROM test_table ) OR a IN ( select c FROM test_table )

Using Large Look up table

Problem Statement :
I have two tables - Data (40 cols) and LookUp(2 cols) . I need to use col10 in data table with lookup table to extract the relevant value.
However I cannot make equi join . I need a join based on like/contains as values in lookup table contain only partial content of value in Data table not complete value. Hence some regex based matching is required.
Data Size :
Data Table : Approx - 2.3 billion entries (1 TB of data)
Look up Table : Approx 1.4 Million entries (50 MB of data)
Approach 1 :
1.Using the Database ( I am using Google Big Query) - A Join based on like take close to 3 hrs , yet it returns no result. I believe Regex based join leads to Cartesian join.
Using Apache Beam/Spark - I tried to construct a Trie for the lookup table which will then be shared/broadcast to worker nodes. However with this approach , I am getting OOM as I am creating too many Strings. I tried increasing memory to 4GB+ per worker node but to no avail.
I am using Trie to extract the longest matching prefix.
I am open to using other technologies like Apache spark , Redis etc.
Do suggest me on how can I go about handling this problem.
This processing needs to performed on a day-to-day basis , hence time and resources both needs to be optimized .
However I cannot make equi join
Below is just to give you an idea to explore for addressing in pure BigQuery your equi join related issue
It is based on an assumption I derived from your comments - and covers use-case when y ou are looking for the longest match from very right to the left - matches in the middle are not qualified
The approach is to revers both url (col10) and shortened_url (col2) fields and then SPLIT() them and UNNEST() with preserving positions
UNNEST(SPLIT(REVERSE(field), '.')) part WITH OFFSET position
With this done, now you can do equi join which potentially can address your issue at some extend.
SO, you JOIN by parts and positions then GROUP BY original url and shortened_url while leaving only those groups HAVING count of matches equal of count of parts in shorteded_url and finally you GROUP BY url and leaving only entry with highest number of matching parts
Hope this can help :o)
This is for BigQuery Standard SQL
#standardSQL
WITH data_table AS (
SELECT 'cn456.abcd.tech.com' url UNION ALL
SELECT 'cn457.abc.tech.com' UNION ALL
SELECT 'cn458.ab.com'
), lookup_table AS (
SELECT 'tech.com' shortened_url, 1 val UNION ALL
SELECT 'abcd.tech.com', 2
), data_table_parts AS (
SELECT url, x, y
FROM data_table, UNNEST(SPLIT(REVERSE(url), '.')) x WITH OFFSET y
), lookup_table_parts AS (
SELECT shortened_url, a, b, val,
ARRAY_LENGTH(SPLIT(REVERSE(shortened_url), '.')) len
FROM lookup_table, UNNEST(SPLIT(REVERSE(shortened_url), '.')) a WITH OFFSET b
)
SELECT url,
ARRAY_AGG(STRUCT(shortened_url, val) ORDER BY weight DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT url, shortened_url, COUNT(1) weight, ANY_VALUE(val) val
FROM data_table_parts d
JOIN lookup_table_parts l
ON x = a AND y = b
GROUP BY url, shortened_url
HAVING weight = ANY_VALUE(len)
)
GROUP BY url
with result as
Row url shortened_url val
1 cn457.abc.tech.com tech.com 1
2 cn456.abcd.tech.com abcd.tech.com 2

How do I select all rows for a clustering column in cassandra?

I have a Partion key: A
Clustering columns: B, C
I do understand I can query like this
Select * from table where A = ?
Select * from table where A = ? and B = ?
Select * from table where A = ? and B = ? and C = ?
On certain cases, I want the B value to be any value in that column.
Is there a way I can query like the following?
Select * from table where A = ? and B = 'any value' and C = ?
Option 1:
In Cassandra, you should design your data model to suit your queries. Therefore the proper way to support your fourth query (queries by A and C, but not necessarily knowing B value), is to create a new table to handle that specific query. This table will be pretty much the same, except the CLUSTERING COLUMNS will be in slightly different order:
PRIMARY KEY (A, C, B)
Now this query will work:
Select * from table where A = ? and C = ?
Option 2:
Alternatively you can create a materialized view, with a different clustering order. Now Cassandra will keep the MV in sync with your table data.
create materialized view mv_acbd as
select A, B, C, D
from TABLE1
where A is not null and B is not null and C is not null
primary key (A, C, B);
Now the query against this MV will work like a charm
Select * from mv_acbd where A = ? and C = ?
Option 3:
Not the best, but you could use the following query with your table as it is
Select * from table where A = ? and C = ? ALLOW FILTERING
Relying on ALLOW FILTERING is never a good idea, and is certainly not something that you should do in a production cluster. For this particular case, the scan is within the same partition and performance may vary depending on ratio of how many clustering columns per partition your use case has.

Calculating median values in HIVE

I have the following table t1:
key value
1 38.76
1 41.19
1 42.22
2 29.35182
2 28.32192
3 33.66
3 33.47
3 33.35
3 33.47
3 33.11
3 32.98
3 32.5
I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:
1 41.19
2 28.83
3 33.35
However, the percentile_approx function returns these:
1 39.974999999999994
2 28.32192
3 33.23.0000000000004
Which clearly are not the median values.
This was the query I ran:
select key, percentile_approx(value, 0.5, 10000) as median
from t1
group by key
It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?
In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.
set hive.exec.parallel=true;
select temp1.key,temp2.value
from
(
select key,cast(sum(rank)/count(key) as int) as final_rank
from
(
select key,value,
row_number() over (partition by key order by value) as rank
from t1
) temp
group by key )temp1
inner join
( select key,value,row_number() over (partition by key order by value) as rank
from t1 )temp2
on
temp1.key=temp2.key and
temp1.final_rank=temp3.rank;
Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.

Resources