We are trying to convert below sql query to sparksql query, but it fails when we add multiple pivot statements.
select * from (select columnName, columnName1,code,CodeDisplay,concat('code_',row_number() over (partition by claimidentifierValueDetails order by claimidentifierValueDetails, code)) as codeSequence,concat('codeDisplay_',row_number() over (partition by claimidentifierValueDetails order by claimidentifierValueDetails, code)) as codeSequence1 from Tablename)temp
pivot (max(code) FOR codeSequence IN (code_1,code_2,code_3)) pivot1
pivot (max(CodeDisplay) FOR codeSequence1 IN (codeDisplay_1,codeDisplay_2,codeDisplay_3) pivot2
When I try to run same in sparkSql, it fails when I add 2 pivot statements, but works with 1.
select * from (select columnName, columnName1,code,CodeDisplay,concat('code_',row_number() over (partition by claimidentifierValueDetails order by claimidentifierValueDetails, code)) as codeSequence,concat('codeDisplay_',row_number() over (partition by claimidentifierValueDetails order by claimidentifierValueDetails, code)) as codeSequence1 from Tablename)temp
pivot (max(code) FOR codeSequence IN ('code_1','code_2','code_3')) pivot1
pivot (max(CodeDisplay) FOR codeSequence1 IN ('codeDisplay_1','codeDisplay_2','codeDisplay_3') pivot2
could you please tell how ro convert it?
Related
I have seen a very strange issue of sql query on Azure sql server.
I have a query using sub-query in "in" clause
This one return error "Error converting data type nvarchar to numeric"
select * from tbl1 where id in (select id from table2 inner join table3 on table2.a=table3.a where status=0 )
This one works. subquery only returns 500 rows, so the top 1000000 should not change anything
select * from tbl1 where id in (select top 1000000 id from table2 inner join table3 on table2.a=table3.a where status=0 )
Thanks
My query is about to get row number as SN in a SQL query from access database in which I get total sales for the day group by [bill Date] clause
my Working Query is:
sql = "SELECT [Bill Date] as [Date], Sum(Purchase) + Sum(Returns) as [Total Sales] FROM TableName Group By [Bill Date];"
I found this Row_Number Clause over Internet and i tried like this.
sql = "SELECT ROW_NUMBER() OVER (ORDER BY [Bill Date]) AS [SN], [Bill Date] as [Date], Sum(Purchase) + Sum(Returns) as [Total Sales] FROM TableName Group By [Bill Date];"
when i Run the about Code i get this error.
-2147217900 Syntax error (missing operator) in query expression ROW_NUMBER() OVER (ORDER BY [Bill Date]);"
i am using Excel Vba to connect to Access Database
Could any one help me to to get it in correct order.
Looks like you aren't define DESC or ASC on (ORDER BY [Bill Date]), need be something like this (ORDER BY [Bill Date] DESC) or (ORDER BY [Bill Date] ASC)
I am working on pyspark, need to write a query which reads data from hive table and returns a pyspark dataframe containing all the columns and row number.
This is what I tried :
SELECT *, ROW_NUMBER() OVER () as rcd_num FROM schema_name.table_name
This query works fine in hive, but when I run it from a pyspark script it throws the following error:
Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;
Please suggest some solution.
Note: I do not wish to order the rows in any particular order, I just need row numbers for all the rows present in the table without any sorting or ordering.
Using spark 2.1
ROW_NUMBER()might be required ordering so you can used monotonicallyIncreasingId function which gives you row numbers for all the rows present in the table.
from pyspark.sql.functions import monotonicallyIncreasingId
df.withColumn("rcd_num ", monotonicallyIncreasingId())
OR
SELECT *, ROW_NUMBER() OVER (Order by (select NULL)) as rcd_num FROM schema_name.table_name
you can set order by select NULL
I am new to spark environment. I have dataset with column names as follows:
user_id, Date_time, order_quantity
I want to calculate the 90th percentile of order_quantity for each user_id.
If it were to be sql, I would have used the following query:
%sql
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)
However, spark doesn't have the built in support for using the percentile_cont function.
Any suggestions on how I can implement this in spark on the above dataset?
please let me know if more information is needed.
I have a solution for PERCENTILE_DISC (0.9) which will return the discrete order_quantity closest to percentile 0.9 (without interpolation).
The idea is to calculate PERCENT_RANK, substract 0.9 and calculate Absolute value, then take the minimal value:
%sql
WITH temp1 AS (
SELECT
user_id,
ABS(PERCENTILE_RANK () OVER
(PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp
SELECT
user_id,
FIRST_VALUE(order_quantity) OVER
(PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90
FROM
temp1;
I was dealing with a similar issue too. I worked in SAP HANA and then I moved to Spark SQL on Databricks. I have migrated the following SAP HANA query:
SELECT
DISTINCT ITEM_ID,
LOCATION_ID,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM MY_TABLE
to
SELECT DISTINCT
ITEM_ID,
LOCATION_ID,
PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
delta.`MY_TABLE`
In your particular case it should be as follows:
SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)
I hope this helps.
We can easily read records from a Hive table in Spark with this command:
Row[] results = sqlContext.sql("FROM my_table SELECT col1, col2").collect();
But when I join two tables, such as:
select t1.col1, t1.col2 from table1 t1 join table2 t2 on t1.id = t2.id
How to retrive the records from the above join query?
SparkContext.sql method always returns DataFrame so there is no practical difference between JOIN and any other type of query.
You shouldn't use collect method though, unless fetching data to the driver is really a desired outcome. It is expensive and will crash if data cannot fit in the driver memory.