SQL Transpose rows to columns (group by key variable)? - pivot

I am trying to transpose rows into columns, grouping by a unique identifier (CASE_ID).
I have a table with this structure:
CASE_ID AMOUNT TYPE
100 10 A
100 50 B
100 75 A
200 33 B
200 10 C
And I am trying to query it to produce this structure...
| CASE_ID | AMOUNT1 | TYPE1 | AMOUNT2 | TYPE2 | AMOUNT3 | TYPE3 |
|---------|---------|-------|---------|-------|---------|--------|
| 100 | 10 | A | 50 | B | 75 | A |
| 200 | 33 | B | 10 | C | (null) | (null) |
(assume much larger dataset with large number of possible values for CASE_ID, TYPE and AMOUNT)
I tried to use pivot but I don't need an aggregate function (simply trying to restructure the data). Now I'm trying to somehow use row_number but not sure how.
I'm basically trying to replicate and SPSS command called Casestovars, but need to be able to do it in SQL. thanks.

You can get the result by creating a sequential number with row_number() and then use an aggregate function with CASE expression:
select case_id,
max(case when seq = 1 then amount end) amount1,
max(case when seq = 1 then type end) type1,
max(case when seq = 2 then amount end) amount2,
max(case when seq = 2 then type end) type2,
max(case when seq = 3 then amount end) amount3,
max(case when seq = 3 then type end) type3
from
(
select case_id, amount, type,
row_number() over(partition by case_id
order by case_id) seq
from yourtable
) d
group by case_id;
See SQL Fiddle with Demo.
If you are using a database product that has the PIVOT function, then you can use row_number() with PIVOT, but first I would suggest that you unpivot the amount and type columns first. The basic syntax for a limited number of values in SQL Server would be:
select case_id, amount1, type1, amount2, type2, amount3, type3
from
(
select case_id, col+cast(seq as varchar(10)) as col, value
from
(
select case_id, amount, type,
row_number() over(partition by case_id
order by case_id) seq
from yourtable
) d
cross apply
(
select 'amount', cast(amount as varchar(20)) union all
select 'type', type
) c (col, value)
) src
pivot
(
max(value)
for col in (amount1, type1, amount2, type2, amount3, type3)
) piv;
See SQL Fiddle with Demo.
If you have an unknown number of values, then you can use dynamic SQL to get the result - SQL Server syntax would be:
DECLARE #cols AS NVARCHAR(MAX),
#query AS NVARCHAR(MAX)
select #cols = STUFF((SELECT ',' + QUOTENAME(col+cast(seq as varchar(10)))
from
(
select row_number() over(partition by case_id
order by case_id) seq
from yourtable
) d
cross apply
(
select 'amount', 1 union all
select 'type', 2
) c (col, so)
group by col, so
order by seq, so
FOR XML PATH(''), TYPE
).value('.', 'NVARCHAR(MAX)')
,1,1,'')
set #query = 'SELECT case_id,' + #cols + '
from
(
select case_id, col+cast(seq as varchar(10)) as col, value
from
(
select case_id, amount, type,
row_number() over(partition by case_id
order by case_id) seq
from yourtable
) d
cross apply
(
select ''amount'', cast(amount as varchar(20)) union all
select ''type'', type
) c (col, value)
) x
pivot
(
max(value)
for col in (' + #cols + ')
) p '
execute sp_executesql #query;
See SQL Fiddle with Demo. Each version will give the result:
| CASE_ID | AMOUNT1 | TYPE1 | AMOUNT2 | TYPE2 | AMOUNT3 | TYPE3 |
|---------|---------|-------|---------|-------|---------|--------|
| 100 | 10 | A | 50 | B | 75 | A |
| 200 | 33 | B | 10 | C | (null) | (null) |

Related

Spark SQL - getting row count for each window using spark SQL window functions

I want to use spark SQL window functions to do some aggregations and windowing.
Suppose I'm using the example table provided here a: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I want to run the query to give me the max 2 revenue for each category and also the count of product for each category.
After I run this query
SELECT
product,
category,
revenue
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count
FROM productRevenue) tmp
WHERE
rank <= 2
I got the table like this:
product category revenue count
pro2 tablet 6500 1
mini tablet 5500 2
instead of
product category revenue count
pro2 tablet 6500 5
mini tablet 5500 5
which is what I expected.
How should I write my code to get the right count for each category (instead of using another separate Group By statement)?
In Spark if window clause having order by window defaults to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
For your case add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING in count(*) window clause.
Try with:
SELECT
product,
category,
revenue,count
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
count(*) OVER (PARTITION BY category ORDER BY revenue DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as count
FROM productRevenue) tmp
WHERE
rank <= 2
Change count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count to count(*) OVER (PARTITION BY category ORDER BY category DESC) as count. You will get expected result.
Try below code.
scala> spark.sql("""SELECT
| product,
| category,
| revenue,
| rank,
| count
| FROM (
| SELECT
| product,
| category,
| revenue,
| dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
| count(*) OVER (PARTITION BY category ORDER BY category DESC) as count
| FROM productRevenue) tmp
| WHERE
| tmp.rank <= 2 """).show(false)
+----------+----------+-------+----+-----+
|product |category |revenue|rank|count|
+----------+----------+-------+----+-----+
|Pro2 |tablet |6500 |1 |5 |
|Mini |tablet |5500 |2 |5 |
|Thin |cell phone|6000 |1 |5 |
|Very thin |cell phone|6000 |1 |5 |
|Ultra thin|cell phone|5000 |2 |5 |
+----------+----------+-------+----+-----+

Spark SQL - best way to programmatically loop over a table

Say I have the following spark dataframe:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 2 | 1 |
| 3 | 1 |
| 4 | NULL |
| 5 | 4 |
| 6 | NULL |
| 7 | 6 |
| 8 | 3 |
This dataframe represents a tree structure consisting of several disjoint trees. Now, say that we have a list of nodes [8, 7], and we want to get a dataframe containing just the nodes that are roots of the trees containing the nodes in the list.The output looks like:
| Node_id | Parent_id |
|---------|-----------|
| 1 | NULL |
| 6 | NULL |
What would be the best (fastest) way to do this with spark queries and pyspark?
If I were doing this in plain SQL I would just do something like this:
CREATE TABLE #Tmp
Node_id int,
Parent_id int
INSERT INTO #Tmp Child_Nodes
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
WHILE #num > 0
INSERT INTO #Tmp (
SELECT
p.Node_id
p.Parent_id
FROM
#Tmp t
LEFT-JOIN Nodes p
ON t.Parent_id = p.Node_id)
SELECT #num = COUNT(*) FROM #Tmp WHERE Parent_id IS NOT NULL
END
SELECT Node_id FROM #Tmp WHERE Parent_id IS NULL
Just wanted to know if there's a more spark-centric way of doing this using pyspark, beyond the obvious method of simply looping over the dataframe using python.
parent_nodes = spark.sql("select Parent_id from table_name where Node_id in [2,7]").distinct()
You can join the above dataframe with the table to get the Parent_id of those nodes as well.

lookup in presto using single column against a range in lookup table

I would like to perform a lookup in presto where table is contains my lookup column "lookup_code" and tableb has a range lookup_range_description that i want to return
TableA
# lookup_code
12
2333
50000
TableB
# start_,end_,lookup_range_description
2300,4000, AverageCost1
23300,239900, AverageCost2
193000,193999, AverageCost3
expected result
# lookup_code,start_,end_,lookup_range_description
12,''
2333,2300,4000, AverageCost1
50000,23300,239900, AverageCost2
You may want to use LEFT OUTER JOIN with BETWEEN like this.
select
a.lookup_code
,b.start_
,b.end_
,b.lookup_range_description
from TableA a
left outer join TableB b
on a.lookup_code between b.start_ and b.end_
lookup_code | start_ | end_ | lookup_range_description
-------------+--------+--------+--------------------------
12 | NULL | NULL | NULL
2333 | 2300 | 4000 | AverageCost1
50000 | 23300 | 239900 | AverageCost2

Cassandra - select query with token() function

According to this documentation, I was trying a select query with token() function in it, but it gives wrong results.
I am using below cassandra version
[cqlsh 5.0.1 | Cassandra 2.2.5 | CQL spec 3.3.1 | Native protocol v4]
I was trying token query for below table -
CREATE TABLE price_key_test (
objectid int,
createdOn bigint,
price int,
foo text,
PRIMARY KEY ((objectid, createdOn), price));
Inserted data --
insert into nasa.price_key_test (objectid,createdOn,price,foo) values (1,1000,100,'x');
insert into nasa.price_key_test (objectid,createdOn,price,foo) values (1,2000,200,'x');
insert into nasa.price_key_test (objectid,createdOn,price,foo) values (1,3000,300,'x');
Data in table --
objectid | createdon | price | foo
----------+-----------+-------+-----
1 | 3000 | 300 | x
1 | 2000 | 200 | x
1 | 1000 | 100 | x
Select query is --
select * from nasa.price_key_test where token(objectid,createdOn) > token(1,1000) and token(objectid,createdOn) < token(1,3000)
This query suppose to return row with createdOn 2000, but it returns zero rows.
objectid | createdon | price | foo
----------+-----------+-------+-----
(0 rows)
According to my understanding, token(objectid,createdOn) > token(1,1000) and token(objectid,createdOn) < token(1,3000) should select row with partition key with value 1 and 2000.
Is my understanding correct?
Try flipping your greater/less-than signs around:
aploetz#cqlsh:stackoverflow> SELECT * FROM price_key_test
WHERE token(objectid,createdOn) < token(1,1000)
AND token(objectid,createdOn) > token(1,3000) ;
objectid | createdon | price | foo
----------+-----------+-------+-----
1 | 2000 | 200 | x
(1 rows)
Adding the token() function to your SELECT should help you to understand why:
aploetz#cqlsh:stackoverflow> SELECT objectid, createdon, token(objectid,createdon),
price, foo FROM price_key_test ;
objectid | createdon | system.token(objectid, createdon) | price | foo
----------+-----------+-----------------------------------+-------+-----
1 | 3000 | -8449493444802114536 | 300 | x
1 | 2000 | -2885017981309686341 | 200 | x
1 | 1000 | -1219246892563628877 | 100 | x
(3 rows)
The hashed token values generated are not necessarily proportional to their original numeric values. In your case, token(1,3000) generated a hash that was the smallest of the three, and not the largest.

Include Summation Row with Group By Clause

Query:
SELECT aType, SUM(Earnings - Expenses) "Rev"
FROM aTable
GROUP BY aType
ORDER BY aType ASC
Results:
| aType | Rev |
| ----- | ----- |
| A | 20 |
| B | 150 |
| C | 250 |
Question:
Is it possible to display a summary row at the bottom such as below by using Sybase syntax within my initial query, or would it have to be a separate query altogether?
| aType | Rev |
| ----- | ----- |
| A | 20 |
| B | 150 |
| C | 250 |
=================
| All | 320 |
I couldn't get the ROLLUP function from SQL to translate over to Sybase successfully but I'm not sure if there is another way to do this, if at all.
Thanks!
Have you tried just using a UNION ALL similar to this:
select aType, Rev
from
(
SELECT aType, SUM(Earnings - Expenses) "Rev", 0 SortOrder
FROM aTable
GROUP BY aType
UNION ALL
SELECT 'All', SUM(Earnings - Expenses) "Rev", 1 SortOrder
FROM aTable
) src
ORDER BY SortOrder, aType
See SQL Fiddle with Demo. This gives the result:
| ATYPE | REV |
---------------
| A | 10 |
| B | 150 |
| C | 250 |
| All | 410 |
May be you can work out with compute by clause in sybase like:
create table #tmp1( name char(9), earning int , expense int)
insert into #tmp1 values("A",30,20)
insert into #tmp1 values("B",50,30)
insert into #tmp1 values("C",60,30)
select name, (earning-expense) resv from #tmp1
group by name
order by name,resv
compute sum(earning-expense)
OR
select name, convert(varchar(15),(earning-expense)) resv from #tmp1
group by name
union all
SELECT "------------------","-----"
union all
select "ALL",convert(varchar(15),sum(earning-expense)) from #tmp1
Thanks,
Gopal
Not all versions of Sybase support ROLLUP. You can do it the old fashioned way:
with t as
(SELECT aType, SUM(Earnings - Expenses) "Rev"
FROM aTable
GROUP BY aType
)
select t.*
from ((select aType, rev from t) union all
(select NULL, sum(rev))
) t
ORDER BY (case when atype is NULL then 1 else 0 end), aType ASC
This is the yucky, brute force approach. If this version of Sybase doesn't support with, you can do:
select t.aType, t.Rev
from ((SELECT aType, SUM(Earnings - Expenses) "Rev"
FROM aTable
GROUP BY aType
) union all
(select NULL, sum(rev))
) t
ORDER BY (case when atype is NULL then 1 else 0 end), aType ASC
This is pretty basic, standard SQL.

Resources