SNOWFLAKE - PIVOT QUERY - pivot

Table 1:
ID AUDIT_KEY Col_Name
1 100 FULL NAME
2 101 FNAME
3 102 LNAME
4 103 ADDR1
Table 2:
ID_key AUDITKEY Col_Value
1 100 john abraham
2 101 johny
3 102 Abraham
4 103 6900 Forester Rd
1 104 Praveen Gupta
2 105 Praveen
3 106 Gupta
4 107 3200 Walter RD
Looking for QUERY TO FORM THE RESULT AS BELOW
ID NAME FNAME LNAME ADDR1
1 JOHN ABRAHAM JOHNY ABRAHAM 6900 Forester Rd
2 PRAVEEN GUPTA PRAVEEN GUPTA 3200 WALTER RD
Written pivot query on table1 when i join table2 based on the ID & ID_KEY is not working.. any idea folks..
SELECT ID, NAME,FNAME,LNAME,ADDR1
FROM TABLE1
INNER JOIN TABLE2 ON TABLE1.ID = TABLE2.ID
PIVOT (MAX(TABLE1.COL_NAME)
FOR TABLE2.COL_VALUE IN ('ID','NAME','FNAME','LNAME','ADDR1')) AS TMP

I would suggest doing the join in a CTE and then do the pivot against the product of that. something along the lines of:
WITH x AS (
SELECT t1.col_name, t2.col_value
FROM TABLE1 t1
INNER JOIN TABLE2 t2
ON t1.id = t2.id
)
SELECT *
PIVOT (MAX(COL_NAME)
FOR COL_VALUE IN ('NAME','FNAME','LNAME','ADDR1')) AS TMP;
Note - I haven't tested this myself, so see if it works. I don't think Snowflake likes join and pivot in the same query, so doing the join inside a CTE provides a single table input into the pivot function.
Also, note, you had 'ID' in the IN clause, but that isn't a col_value value in your sample data, so you don't want it there. That could also be your issue.

Related

How to join a spark dataframe twice with different id type

I have a spark.DataFrame called events and I want to join with another spark.DataFrame called users. Therefore the User can ben identified on events dataframe with two different types of Id.
The schema of dataframes can be seen below:
Events:
Id
IdType
Name
Date
EventType
324
UserId
Daniel
2022-01-15
purchase
350
UserId
Jack
2022-01-16
purchase
3247623322
UserCel
Michelle
2022-01-10
claim
Users:
Id
Name
Cel
324
Daniel
5511737379
350
Jack
3247623817
380
Michelle
3247623322
What I want to do is to left join the events dataframe twice in order to extract all the events despite the IdType used on events dataframe
The final dataframe I want must be as follows:
Id
Name
Cel
Date
EventType
324
Daniel
5511737379
2022-01-15
Purchase
350
Jack
3247623817
2022-01-16
Purchase
380
Michelle
3247623322
2022-01-10
Claim
I guess the python (PySpark code) for this join might be close to:
(users.join(events, on = [users.Id == events.Id], how = 'left')
.join(events, on = [users.Cel == events.Id], how = 'left'))
You can do that with the following code
with_id = (users.join(events, on=users["Id"]==events["Id"], how='inner')
.select(events["Id"], events["Name"],"Cel","Date","EventType"))
incorrect_id = (users.join(events, on=users["Id"]==events["Id"], how='leftanti')
.join(events, on=users["Cel"]==events["Id"])
.select(users["Id"], events["Name"],"Cel","Date","EventType"))
result = with_id.unionAll(incorrect_id)
The result
result.show()
+---+--------+----------+----------+---------+
| Id| Name| Cel| Date|EventType|
+---+--------+----------+----------+---------+
|324| Daniel|5511737379|2022-01-15| purchase|
|350| Jack|3247623817|2022-01-16| purchase|
|380|Michelle|3247623322|2022-01-10| claim|
+---+--------+----------+----------+---------+

filtering rows in one dataframe based on two columns of another dataframe

I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.
If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb
use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb

Find the total score. for each student

Hi I have a query to retrive students score in this format
Id subject class_score total_marks rank subject_type
1 MATLAB 33.80 73.30 11 Core
1 SCIENCE 39.50 83.00 4 Core
1 ENGLISH 37.60 77.60 8 Elective
1 WQP 43.90 77.40 9 Core
1 BDT 42.00 71.50 12 Elective
1 ART 47.10 84.60 2 Elective
1 COMPUTING 26.00 65.50 13 Elective
2 MATLAB 0.00 0.00 14 Core
2 SCIENCE 38.60 73.60 10 Core
2 ENGLISH 43.80 83.30 3 Elective
2 WQP 45.00 89.00 1 Core
2 BDT 41.00 79.50 6 Elective
2 ART 38.00 78.00 7 Elective
2 COMPUTING 40.80 80.80 5 Elective
I wish to calculate the overall score of each student by add the total_marks of all (Core subject + the top 3 score of the student's elective subjects) and rank them from first to last.Please can anyone assist ?Thanks
My current
use ods_sms;
SELECT student_id,sum(t.total_marks)score
FROM
(select
student_id,total_marks from tab_exam_tracker
where (subject_type='Core')
union all select t.student_id,t.total_marks from
(select student_id,total_marks from tab_exam_tracker
where subject_type = 'Elective' order by total_marks desc
limit 3) t
) t
GROUP BY student_id
But am getting wrong results
SELECT
t.id,
SUM(total_marks) AS Core_Marks,
Elective_Marks
FROM tab_exam_tracker t
INNER JOIN (
SELECT id, SUM(total_marks) as Elective_Marks
FROM (
SELECT
id,
total_marks,
IF(#id <> id, #rank:= 1, #rank:= #rank + 1) as rank,
#id:= id
FROM tab_exam_tracker,
(SELECT #id:= 0 r_id, #rank:= 0 AS r) AS a
WHERE subject_type = 'Elective'
ORDER BY id, total_marks DESC
) as b
WHERE rank <= 3
GROUP BY id
) AS e_marks ON t.id = e_marks.id
GROUP BY t.id, Elective_Marks;

Complex pandas aggregation

I have a table as below :
User_ID Cricket Football Chess Video_ID Category Time
1 200 150 100 111 A Morning
1 200 150 100 222 B Morning
1 200 150 100 111 A Afternoon
1 200 150 100 333 A Morning
2 100 160 80 444 C Evening
2 100 160 80 222 C Evening
2 100 160 80 333 A Morning
2 100 160 80 333 A Morning
Above table is a transactional table, each entry represents the transaction of a user watching a video.
For Eg. “User_ID” - 1 has watched video’s 4 times.
What all video’s watched are given in “Video_ID” : 111,222,111,333
NOTE :
Video_ID - 111 was watched twice by this user.
Cricket, Football, Chess : The values are duplicate for each row. (I.e) No of times “User_ID” 1 played cricket , football, chess are 200,150,100. ( They are duplicate in other rows for that particular “User_ID”.
Category : Which Category that particular Video_ID belongs to.
Time : What time the Video_ID was watched.
I am trying to get the below information from the table :
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
1 Cricket Football A B Morning
2 Football Cricket C A Evening
NOTE : If the count of Category is same then any one can be kept as Top_1_Category.
Its bit complex though, can anyone help on this ?
First get top values per groups by User_ID and Video_ID with Series.value_counts and index[0]:
df1 = df.groupby(['User_ID','Video_ID']).agg(lambda x: x.value_counts().index[0])
Then get second top Category by GroupBy.nth:
s = df1.groupby(level=0)['Category'].nth(1)
Remove duplicates by User_ID with DataFrame.drop_duplicates:
df1 = df1.reset_index().drop_duplicates('User_ID').drop('Video_ID', axis=1)
cols = ['User_ID','Category','Time']
cols1 = df1.columns.difference(cols)
Get top2 games by this solution:
df2 = pd.DataFrame((cols1[np.argsort(-df1[cols1].values, axis=1)[:,:2]]),
columns=['Top_1_Game','Top_2_Game'],
index=df1['User_ID'])
Filter Category and Time with rename columns names:
df3 = (df1[cols].set_index('User_ID')
.rename(columns={'Category':'Top_1_Cat','Time':'Top_Time'}))
Join together by DataFrame.join and DataFrame.insert Top_2_Cat values:
df = df2.join(df3).reset_index()
df.insert(4, 'Top_2_Cat', s.values)
print (df)
User_ID Top_1_Game Top_2_Game Top_1_Cat Top_2_Cat Top_Time
0 1 Cricket Football A B Morning
1 2 Football Cricket C A Evening

column values in a row

I have following table
id count hour age range
-------------------------------------
0 5 10 61 10-200
1 6 20 61 10-200
2 7 15 61 10-200
5 9 5 61 201-300
7 10 25 61 201-300
0 5 10 62 10-20
1 6 20 62 10-20
2 7 15 62 10-20
5 9 5 62 21-30
1 8 6 62 21-30
7 10 25 62 21-30
10 15 30 62 31-40
I need to select distinct values of column range
I tried following query
Select distinct range as interval from table name where age = 62;
its result is in a column as follows:
interval
----------
10-20
21-30
31-41
How can I get result as follows?
10-20, 21-30, 31-40
EDITED:
I am now trying following query:
select sys_connect_by_path(range,',') interval
from
(select distinct NVL(range,'0') range , ROW_NUMBER() OVER (ORDER BY RANGE) rn
from table_name where age = 62)
where connect_by_isleaf = 1 CONNECT BY rn = PRIOR rn+1 start with rn = 1;
Which is giving me output as:
Interval
----------------------------------------------------------------------------
, 10-20,10-20,10-20,21-30,21-30, 31-40
guys plz help me to get my desired output.
If you are on 11.2 rather than just 11.1, you can use the LISTAGG aggregate function
SELECT listagg( interval, ',' )
WITHIN GROUP( ORDER BY interval )
FROM (SELECT DISTINCT range AS interval
FROM table_name
WHERE age = 62)
If you are using an earlier version of Oracle, you could use one of the other Oracle string aggregation techniques on Tim Hall's page. Prior to 11.2, my personal preference would be to create a user-defined aggregate function so that you can then
SELECT string_agg( interval )
FROM (SELECT DISTINCT range AS interval
FROM table_name
WHERE age = 62)
If you don't want to create a function, however, you can use the ROW_NUMBER and SYS_CONNECT_BY_PATH approach though that tends to get a bit harder to follow
with x as (
SELECT DISTINCT range AS interval
FROM table_name
WHERE age = 62 )
select ltrim( max( sys_connect_by_path(interval, ','))
keep (dense_rank last order by curr),
',') range
from (select interval,
row_number() over (order by interval) as curr,
row_number() over (order by interval) -1 as prev
from x)
connect by prev = PRIOR curr
start with curr = 1

Resources