Invalid identifier in pivot query snowflake - pivot

We have data in following format as the output; columns are date, cohort name, and rate.please ignore the remaining columns;
Date Group_nm Rate
2019-10-08 A 0.43
2019-10-09 A 0.46
2019-10-08 B 1.5
2019-10-09 B 2
The goal is to get it pivoted in following format;
Group_nm 2019-10-08 2019-10-09
A 0.43 0.46
B 1.5 2
Here is my attempt;
SELECT * FROM
(SELECT
date, group_nm, rate
FROM CTE1
)
AS StudentTable
PIVOT(MAX(rate)
FOR date IN ('2019-10-08','2019-10-09')
) AS StudentPivotTable;
But I am getting an error; "Invalid identifier rate".Please note that no agregation is needed here, since pivot always asks for aggregation, we just gave MAX() for sake of completeness. Help is appreciated.

Try using a CTE instead of that sub-select. I've used a CTE to explicitly load your values, but you could run that select from CTE1 in an actual CTE:
WITH x as (
SELECT $1::date as date_fld, $2 as group_nm, $3 as rate
FROM (VALUES
('2019-10-08','A',0.43),('2019-10-09','A',0.46),('2019-10-08','B',1.5),('2019-10-09','B',2)
)
)
SELECT * FROM x
PIVOT(max(rate) for date_fld in ('2019-10-08'::date,'2019-10-09'::date)) as p;
Also, the documentation examples are very helpful: https://docs.snowflake.net/manuals/sql-reference/constructs/pivot.html

Related

Azure Stream Analytics - Joining Two Streaming Source

I am trying to join 2 Streaming Source which produces the same data output from EventHub.
I am trying to find the Maximum Open Price for the Stock every 5 mins and trying to write it to the the Table. I am interested in the time at within the 5 min window at which the stock was maximum and the window time.
I used the below mentioned query but it isn't producing any output for the same.
I think I have messed the joining the condition.
WITH Source1 AS (
SELECT
System.TimeStamp() as TimeSlot,max([open]) as 'MaxOpenPrice'
FROM
EventHubInputData TIMESTAMP BY TimeSlot
GROUP BY TumblingWindow(minute,5)
),
Source2 AS(
SELECT EventEnqueuedUtcTime,[open]
FROM EventHubInputDataDup TIMESTAMP BY EventEnqueuedUtcTime),
Source3 as (
select Source2.EventEnqueuedUtcTime as datetime,Source1.MaxOpenPrice,System.TimeStamp() as TimeSlot
FROM Source1
JOIN Source2
ON Source2.[Open] = Source1.[MaxOpenPrice] AND DATEDIFF (minute,Source1,Source2) BETWEEN 0 AND 5
)
SELECT datetime,MaxOpenPrice,TimeSlot
INTO EventHubOutPutSQLDB
FROM Source3 ```
The logic is good here. First you identify the maximum value on each window of 5 minutes, then you lookup in the original stream the time at which it happened.
WITH MaxOpen5MinTumbling AS (
SELECT
--TickerId,
System.TimeStamp() AS WindowEnd, --this always return the end of the window when windowing
MAX([open]) AS 'MaxOpenPrice'
FROM EventHubInputData --no need to timestamp if using ingestion time
GROUP BY TumblingWindow(minute,5)
)
SELECT
--M.TickedId,
M.WindowEnd,
M.MaxOpenPrice,
O.EventEnqueuedUtcTime AS MaxOpenPriceTime
FROM MaxOpen5MinTumbling M
LEFT JOIN EventHubInputData O
ON M.MaxOpenPrice = o.[open]
AND DATEDIFF(minute,M,O) BETWEEN -5 AND 0 --The new timestamp is at the end of the window, you need to look back 5 minutes
--AND M.TickerId = O.TickerId
Note that at this point you could get multiple results per time window if the max price happens multiple times.

Snowflake unpivoting

I need to transpose a table in which column1 is name of an entity and column2 to column366 are dates in a year that hold a dollar amount. The table, the select statement and the output result are all given
below -
Question - This syntax requires me to create a comma separated list of columns - which are basically 365 dates - and use that list in the IN clause of the select statement.
Like this -
.....unpivot (cash for dates in ("1-1-2020" , "1-2-2020" , "1-3-2020"........."12-31-2020")) order by 2
Is there any better way of doing this ? Like with regular expressions ? I don't want to type 365 dates in mm-dd-yyyy format and get carpel tunnel for my trouble
Here is the table - First line is column header, second line is separator. 3rd, 4th and 5th lines are sample data.
Name 01-01-2020 01-02-2020 01-03-2020 12-31-2020
---------------------------------------------------------------------------------------------------
Entity1 10.00 15.75 20.00 100.00
Entity2 11.00 16.75 20.00 10.00
Entity3 112.00 166.75 29.00 108.00
I can transpose it using the select statement below
select * from Table1
unpivot (cash for dates in ("1-1-2020" , "1-2-2020" , "1-3-2020")) order by 2
to get an output like the one below -
Name-------------------dates-----------------------cash
--------------------------------------------------------------
Entity1 01-01-2020 10.00
Entity2 01-01-2020 11.00
Entity3 01-01-2020 112.00
...............................
.............................
.........
and so on
There is a simpler way to do this without PIVOT. Snowflake gives you a function to represent an entire row as an "OBJECT" -- a collection of key-value pairs. With that representation, you can FLATTEN each element and extract both the column name (key == date) and the value inside (value == cash). Here is a query that will do it:
with obj as (
select OBJECT_CONSTRUCT(*) o from Table1
)
select o:NAME::varchar as name,
f.key::date as date,
f.value::float as cash
from obj,
lateral flatten (input => obj.o, mode => 'OBJECT') f
where f.key != 'NAME'
;

Query optimization in Cassandra

I have a cassandra database that I need to query
My table looks like this:
Cycle Parameters Value
1 a 999
1 b 999
1 c 999
2 a 999
2 b 999
2 c 999
3 a 999
3 b 999
3 c 999
4 a 999
4 b 999
4 c 999
I need to get values for parameters "a" and "b" for two cycles , no matter which "cycle" it is
Example results:
Cycle Parameters Value
1 a 999
1 b 999
2 a 999
2 b 999
or
Cycle Parameters Value
1 a 999
1 b 999
3 a 999
3 b 999
Since the database is quite huge, every query optimization is welcome..
My requirements are:
I want to do everything in 1 query
Would be a plus a answer with no nested query
So far, I was able to accomplish these requirements with something like this:
select * from table where Parameters in ('a','b') sort by cycle, parameters limit 4
However, this query needs a "sortby" operation that causes huge processing in the database...
Any clues on how to do it? ....limit by partition maybe?
EDIT:
The table schema is:
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
"parameters" is the partition key and "cycle" is the clustering column
You can't query like this without ALLOW FILTERING, don't use allow filtering in production Only use it for development!
Read the datastax doc about using ALLOW FILTERING https://docs.datastax.com/en/cql/3.3/cql/cql_reference/select_r.html?hl=allow,filter
I assume your current schema is :
CREATE TABLE data (
cycle int,
parameters text,
value double,
primary key(cycle, parameters)
)
And you need another table or change your table schema to query like these
CREATE TABLE cycle_data (
cycle int,
parameters text,
value double,
primary key(parameters,cycle)
)
Now you can query
SELECT * FROM cycle_data WHERE parameters in ('a','b');
These result will automatically sorted in ascending order by cycle for every parameters

Calculating median values in HIVE

I have the following table t1:
key value
1 38.76
1 41.19
1 42.22
2 29.35182
2 28.32192
3 33.66
3 33.47
3 33.35
3 33.47
3 33.11
3 32.98
3 32.5
I want to compute the median for each key group. According to the documentation, the percentile_approx function should work for this. The median values for each group are:
1 41.19
2 28.83
3 33.35
However, the percentile_approx function returns these:
1 39.974999999999994
2 28.32192
3 33.23.0000000000004
Which clearly are not the median values.
This was the query I ran:
select key, percentile_approx(value, 0.5, 10000) as median
from t1
group by key
It seems to be not taking into account one value per group, resulting in a wrong median. Ordering does not affect the result. Any ideas?
In Hive, median cannot be calculated directly by using available built-in functions. Below query is used to find the median.
set hive.exec.parallel=true;
select temp1.key,temp2.value
from
(
select key,cast(sum(rank)/count(key) as int) as final_rank
from
(
select key,value,
row_number() over (partition by key order by value) as rank
from t1
) temp
group by key )temp1
inner join
( select key,value,row_number() over (partition by key order by value) as rank
from t1 )temp2
on
temp1.key=temp2.key and
temp1.final_rank=temp3.rank;
Above query finds the row_number for each key by ordering the values for the key. Finally it will take the middle row_number of each key which gives the median value. Also I have added one more parameter "hive.exec.parallel=true;" which enables to run the independent tasks in parallel.

How to compute median value of sales figures?

In MySQL while there is an AVG function, there is no Median. So I need to create a way to compute Median value for sales figures.
I understand that Median is the middle value. However, it isn't clear to me how you handle a list that isn't an odd numbered list. How do you determine which value to select as the Median, or is further computation needed to determine this? Thanks!
I'm a fan of including an explicit ORDER BY statement:
SELECT t1.val as median_val
FROM (
SELECT #rownum:=#rownum+1 as row_number, d.val
FROM data d, (SELECT #rownum:=0) r
WHERE 1
-- put some where clause here
ORDER BY d.val
) as t1,
(
SELECT count(*) as total_rows
FROM data d
WHERE 1
-- put same where clause here
) as t2
WHERE 1
AND t1.row_number=floor(total_rows/2)+1;

Resources