Alternative to over(partition...) function, because it is not supported - partition

The solution to this question might be simple, but I can't translate other posts about this topic into my own script.
I'm looking for a query to select the highest delivery time for each consignment number, since a consignment can have more than one delivery time's, because it can have more than one parcels.
I came up with this query, and it works fine when I'm using SQL server.
select
DELIVERYTIME
from (
select
h_parcel.CONSIGNMENT, S_PARCEL.DELIVERYTIME,
(row_number() over(partition by h_parcel.CONSIGNMENT order by S_PARCEL.DELIVERYTIME desc)) as rn
from
S_PARCEL
inner join
h_parcel on h_parcel.h_parcel = s_parcel.h_parcel) as t
where
t.rn = 1
This code is used to fill a column in an ETL process, which is done in Visual Studio. Visual Studio does not support the function over(partition by....), so this code has to be translated into a code without the partition function. Can someone please help me :)?
Thanks.

Related

Databricks SQL nondeterministic expressions using DELETE FROM

I am trying to execute the following SQL clause using Databricks SQL:
DELETE FROM prod_gbs_gpdi.bronze_data.sapex_ap_posted AS HISTORICAL_DATA
WHERE
HISTORICAL_DATA._JOB_SOURCE_FILE = (SELECT MAX(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
The intention of the query is to delete a set of rows in a historical data table based on a value present in a column of new data table.
For reasons that I cannot understand it is raising an error like:
Error in SQL statement: AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
(HISTORICAL_DATA._JOB_SOURCE_FILE IN (listquery()))
in operator DeleteCommandEdge
It seems it is not accepting a subquery inside the where clause. That's odd for me, as in the Databricks documentation Link it is acceptable.
I even tried other types of predicates, like:
(SELECT FIRST(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
(SELECT DISTINCT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
IN (SELECT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
None of them seems to take effect in executing the query successfully.
What's even odd for me is that I was able to accomplish a similar case with a slightly different query, as it can be seen in this link.
I have created demo_table1 & demo_table2 for querying purpose. I have created the following query carrying the similar purpose. I haven’t considered double aliases and have given straight query using subquery, it also depends on data frame in usage use a normal pandas data frame. it works fine for me.
delete from demo_table1 as t1 where t1.age = (select min(t2.age) from demo_table2 as t2);

Cannot execute CTE query via Azure DataBricks

I'm trying to execute CTE query via data bricks getting syntax error for SQL query. Is there any other to use CTE from Data bricks?
Thanks in Advance .
pushdown_query = """(WITH t(x, y) AS (SELECT 1, 2)
SELECT * FROM t WHERE x = 1 AND y = 2) as Test """
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Error:-
"com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword WITH."
Just remove the as test syntax:
WITH t(x, y) AS (SELECT 1, 2) SELECT * FROM t WHERE x = 1 AND y = 2
This will work.
It is impossible using spark.read jdbc. When you use dbtable or query parameters, effect is to insert your SQL code as a subquery inside larger SELECT statement.
Spark docs for dbtable param are poor, IMHO, but you can see where this heading in query doc.
As an example, spark will issue a query of the following form to the JDBC Source.
SELECT <columns> FROM (<user_specified_query>) spark_gen_alias
The thing we specify gets turned into a subquery.
Add that with the fact that a WITH clause must not be included in a subquery. I'm looking for proof of that claim in the Microsoft docs.
Closest I can find is in doc referred to above,
which says WITH must be followed by single SELECT statement. https://github.com/uglide/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-migrate-code.md
I think CTE functionality is stripped out of Azure SQL Server, which is also known as Synapse. You may be able to re-write some of your queries to do what you need, without using the standard CTE syntax.
These links should shed some light on the situation.
https://github.com/uglide/azure-content/blob/master/articles/sql-data-warehouse/sql-data-warehouse-migrate-code.md
https://learn.microsoft.com/en-us/azure/azure-sql/database/transact-sql-tsql-differences-sql-server

azure data factory syntax for scheduled slice activity

I'm using Azure Data Factory(V2) to schedule a copy pipeline activity - the requirement is that every day the job should run and select everything from a table, from the last 5 days. I have scheduled the copy and tried the following syntax in the source dataset:
select * from [dbo].[aTable] where [aDate] >= '#{formatDateTime(adddays(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' ),-5)}'
But this doesn't work, I'm getting an error stating that adddays is expecting an int for it's second parameter but is receiving a string.
Can anybody advise on the proper way to nest this??
Thanks
I cant test this right now, so I'll risk a possible answer just by looking at your query. I think it should be like this:
select * from [dbo].[aTable] where [aDate] >= '#{formatDateTime(adddays(pipeline().parameters.windowStart, -5), 'yyyy-MM-dd HH:mm')}'
Hope this helps!

Delete all in FOXPRO

This question may seem rudimentary but nothing that I have found online quite fits.
I am looking at an old FOXPRO script that we used to make a table. At the moment, I am attempting to translate this script into SQL. Of note is the following,
delete all for code='000000'
pack
If I understand this correctly, it deletes all rows/records where the code field has a value of 000000. Am I correct?
"Into SQL"
What do you mean by "SQL"?
Do you mean do the same thing using SQL in VFP for a VFP table? If so then:
use myTable exclusive
delete from myTable where code = '000000'
pack
But I doubt you are asking this, when you could do it by simply using the xBase code you wrote.
Do you mean how to that in an SQL backend like MS SQL server. postgreSQL, MySQL ...? If so then:
delete from myTable where code = '000000'
Note: In your code "all" is unnecessary but wouldn't do any harm.
Note2: In VFP code that you wrote, first line is "marking" the rows as "deleted" that have code value '000000'.
Second line really removes those rows from the table.
What version of foxpro?
delete from [table] where code = '00000'
pack
or
delete for code = '000000'
pack
both should work

USQL nested query performance

I have a USQL query that runs fine on it's own against 400M records in a managed table.
But during development, I don't want to run it against all records all the time, so I pop a where clause in, run it for a tiny subsection of data, and it completes in around 2 minutes (#5 AUs), writing out results to a tsv in my data lake.
Happy with that.
However, I now want to use it as the source for a second query and further processing.
So I create a view with the original USQL (minus the where clause).
Then to test, a new script :
'Select * from MyView WHERE <my original test filter>'.
Now I was expecting that to execute in around the same time as the original raw query. But instead I got to 4 minutes, only 10% through the plan, and cancelled - something is not right.
No expert at reading Job Graphs, but ...
The original script kicks off with 2* 'Extract Combine partition' both reading a couple of hundered MBs, my select on the saved View is reading over 100GB !!
So it is not taking the where clause into account at all at this stage.
Obviously this shows how little I yet understand about how DLA works behind the scenes !
Would someone please help me understand (a) what is going on and (b) a path forward to get the behavior I need ?
Currently having a play with stored procedures to store the 1st result in a table and then call the second query against that - but just seems overkill compared with 'traditional' SQL Server ?!?
All pointers & hints appreciated !
Many Thanks
Original Base Query:
CREATE VIEW IF NOT EXISTS Play.[M3_CycleStartPoints]
AS
//#BASE =
SELECT ROW_NUMBER() OVER (PARTITION BY A.[CTNNumber] ORDER BY A.[SeqNo]) AS [CTNCycleNo], A.[CTNNumber], A.[SeqNo], A.[BizstepDescription], A.[ContainerStatus], A.[FillStatus]
FROM
[Play].[RawData] AS A
LEFT OUTER JOIN
(
SELECT [CTNNumber],[SeqNo]+1 AS [SeqNo],[FillStatus],[ContainerStatus],[BizstepDescription]
FROM [Play].[RawData]
WHERE [FillStatus] == "EMPTY" AND [AssetUsage] == "CYLINDER"
) AS B
ON A.[CTNNumber] == B.[CTNNumber] AND A.[SeqNo] == B.[SeqNo]
WHERE (
(A.[FillStatus] == "FULL" AND
A.[AssetUsage] == "CYLINDER" AND
B.[CTNNumber] == A.[CTNNumber]
) OR (
A.[SeqNo] == 1
)
);
//AND A.[CTNNumber] == "BE52XH7";
//Only used to test when running script as stand-alone & output to tsv
Second Query
SELECT *
FROM [Play].[M3_CycleStartPoints]
WHERE [CTNNumber] == "BE52XH7";
Ok, I think I've got this, or at least in part.
Table valued Functions
http://www.sqlservercentral.com/articles/U-SQL/146839/
to allow the passing of an argument to a view and return the result.
Would be interested in finding some reading material around this subject still though.
Coming from a T-SQL world, seems that there are some fundamental differences I'm still tripping over.

Resources