USQL nested query performance - query-performance

USQL nested query performance - query-performance

I have a USQL query that runs fine on it's own against 400M records in a managed table.
But during development, I don't want to run it against all records all the time, so I pop a where clause in, run it for a tiny subsection of data, and it completes in around 2 minutes (#5 AUs), writing out results to a tsv in my data lake.
Happy with that.
However, I now want to use it as the source for a second query and further processing.
So I create a view with the original USQL (minus the where clause).
Then to test, a new script :
'Select * from MyView WHERE <my original test filter>'.
Now I was expecting that to execute in around the same time as the original raw query. But instead I got to 4 minutes, only 10% through the plan, and cancelled - something is not right.
No expert at reading Job Graphs, but ...
The original script kicks off with 2* 'Extract Combine partition' both reading a couple of hundered MBs, my select on the saved View is reading over 100GB !!
So it is not taking the where clause into account at all at this stage.
Obviously this shows how little I yet understand about how DLA works behind the scenes !
Would someone please help me understand (a) what is going on and (b) a path forward to get the behavior I need ?
Currently having a play with stored procedures to store the 1st result in a table and then call the second query against that - but just seems overkill compared with 'traditional' SQL Server ?!?
All pointers & hints appreciated !
Many Thanks
Original Base Query:
CREATE VIEW IF NOT EXISTS Play.[M3_CycleStartPoints]
AS
//#BASE =
SELECT ROW_NUMBER() OVER (PARTITION BY A.[CTNNumber] ORDER BY A.[SeqNo]) AS [CTNCycleNo], A.[CTNNumber], A.[SeqNo], A.[BizstepDescription], A.[ContainerStatus], A.[FillStatus]
FROM
[Play].[RawData] AS A
LEFT OUTER JOIN
(
SELECT [CTNNumber],[SeqNo]+1 AS [SeqNo],[FillStatus],[ContainerStatus],[BizstepDescription]
FROM [Play].[RawData]
WHERE [FillStatus] == "EMPTY" AND [AssetUsage] == "CYLINDER"
) AS B
ON A.[CTNNumber] == B.[CTNNumber] AND A.[SeqNo] == B.[SeqNo]
WHERE (
(A.[FillStatus] == "FULL" AND
A.[AssetUsage] == "CYLINDER" AND
B.[CTNNumber] == A.[CTNNumber]
) OR (
A.[SeqNo] == 1
)
);
//AND A.[CTNNumber] == "BE52XH7";
//Only used to test when running script as stand-alone & output to tsv
Second Query
SELECT *
FROM [Play].[M3_CycleStartPoints]
WHERE [CTNNumber] == "BE52XH7";

Ok, I think I've got this, or at least in part.
Table valued Functions
http://www.sqlservercentral.com/articles/U-SQL/146839/
to allow the passing of an argument to a view and return the result.
Would be interested in finding some reading material around this subject still though.
Coming from a T-SQL world, seems that there are some fundamental differences I'm still tripping over.

Related

Databricks SQL nondeterministic expressions using DELETE FROM

I am trying to execute the following SQL clause using Databricks SQL:
DELETE FROM prod_gbs_gpdi.bronze_data.sapex_ap_posted AS HISTORICAL_DATA
WHERE
HISTORICAL_DATA._JOB_SOURCE_FILE = (SELECT MAX(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
The intention of the query is to delete a set of rows in a historical data table based on a value present in a column of new data table.
For reasons that I cannot understand it is raising an error like:
Error in SQL statement: AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate, Window, or Generate, but found:
(HISTORICAL_DATA._JOB_SOURCE_FILE IN (listquery()))
in operator DeleteCommandEdge
It seems it is not accepting a subquery inside the where clause. That's odd for me, as in the Databricks documentation Link it is acceptable.
I even tried other types of predicates, like:
(SELECT FIRST(NEW_DATA._JOB_SOURCE_FILE) FROM temp_sapex_posted AS NEW_DATA)
(SELECT DISTINCT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
IN (SELECT NEW_DATA._JOB_SOURCE_FILE FROM temp_sapex_posted AS NEW_DATA)
None of them seems to take effect in executing the query successfully.
What's even odd for me is that I was able to accomplish a similar case with a slightly different query, as it can be seen in this link.

I have created demo_table1 & demo_table2 for querying purpose. I have created the following query carrying the similar purpose. I haven’t considered double aliases and have given straight query using subquery, it also depends on data frame in usage use a normal pandas data frame. it works fine for me.
delete from demo_table1 as t1 where t1.age = (select min(t2.age) from demo_table2 as t2);

System Resource Exceeded when alter table counter (autonumber)

I make a simple program in excel that connect to access. The user of this program not good with coding/access, So it should be done with only a few clicks.
the workflow:
1. Import the raw data from BW (in excel) to access. (without ID and has 28k rows)
2. Add the ID Column with autonumber
I use this code for the second part:
acObj.CurrentDb.Execute "ALTER TABLE " & ptableName & " ADD COLUMN ID COUNTER (1, 1);", dbFailOnError
and appears error : 3035 system resource exceeded.
*with only 16k rows works just fine.
Any solution?

You could remove dbFailOnError, and use dbInconsistent instead. That way the query is executed non-transactionally.
The odds of this query failing are pretty much none (it fails if there already is an ID column, but you can check for that). Executing it non-transactionally means Access doesn't have to cache the result, and it uses less system resources.

Spark: Use Temporary Table Twice in Query?

I am working on converting a set of Hive queries to run on spark. So far I have gotten a nice performance boost by creating TEMP tables, where Hive was previously creating new tables on disc. I have run into a query where the TEMP table is being called twice in the same query and these causes a failure. I have tried to write my temp table to disc, but I notice that the "saveAsTable" function is deprecated, and when I try to use it my program fails due to executor timeouts. I would prefer to not have to write to disc anyway. I have considered rewriting the hive query, but would prefer to leave it alone. Do I have any other options?
Sample Query
SELECT d.LEVEL_1,
d.LEVEL_2,
d.CODE
FROM
( SELECT DISTINCT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
I have reduced the query a bit to try and show the basic concept, but may have broken it during reduction.

Your query has multiple parts. Did you try running -
first
SELECT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
second
SELECT DISTINCT CP.LEVEL_1_ID,
LEVEL_2_ID,
FROM ETL_ESTIMATED_PROVIDER_DATA_1_1 CP
LEFT JOIN ETL_ESTIMATED_PROVIDER_DATA_1_1 F ON D.demo_id = F.demo_id
AND D.LEVEL_1_ID = F.LEVEL_1_ID
AND D.LEVEL_2_ID = F.LEVEL_2_ID
Also, the second one is your answer. You need not to do another select on top of that. You are missing in second select d.CODE.
I have ran similar self-joins in spark and it works.

Alternative to over(partition...) function, because it is not supported

The solution to this question might be simple, but I can't translate other posts about this topic into my own script.
I'm looking for a query to select the highest delivery time for each consignment number, since a consignment can have more than one delivery time's, because it can have more than one parcels.
I came up with this query, and it works fine when I'm using SQL server.
select
DELIVERYTIME
from (
select
h_parcel.CONSIGNMENT, S_PARCEL.DELIVERYTIME,
(row_number() over(partition by h_parcel.CONSIGNMENT order by S_PARCEL.DELIVERYTIME desc)) as rn
from
S_PARCEL
inner join
h_parcel on h_parcel.h_parcel = s_parcel.h_parcel) as t
where
t.rn = 1
This code is used to fill a column in an ETL process, which is done in Visual Studio. Visual Studio does not support the function over(partition by....), so this code has to be translated into a code without the partition function. Can someone please help me :)?
Thanks.

Can I import SAP tables that were exported by SE16?

I have exported the contents of a table with transaction SE16, by selecting all the entries and going selecting Download, unconverted.
I'd like to import these entries into another system (where the same table exists and is active).
Furthermore, when I import, there's a possibility that the specific key already exists for a number of entries (old entries).
Other entries won't have a field with the same key present in the table where they're to be imported (new entries).
Is there a way to easily update my table in the second system with the file provided from the first system? If needed, I can export the data in the 3 other format types (Spreadsheet, Rich text format and HTML format). It seems to me though like the spreadsheet and rich text formats sometimes corrupt the data, and the html is far too verbose.
[EDIT]
As per popular demand, the table i'm trying to export / import is a Z table whose fields are all numeric, character, date or time fields (flat data types).
I'm trying to do it like this because the clients don't have any basis resource to help them transport, and would like to "kinna" automate the process of updating one of the tables in one system.
At the moment it's a business request to do it like this, but I'm open to suggestions (and the clients are open too)

Edit
Ok I doubt that what you describe in your comment exists out of the box, but you can easily write something like that:
Create a method (or function module if that floats your boat) that accepts the following:
iv_table name TYPE string and
iv_filename TYPE string
This would be the method:
method upload_table.
data: lt_table type ref to data,
lx_root type ref to cx_root.
field-symbols: <table> type standard table.
try.
create data lt_table type table of (iv_table_name).
assign lt_table->* to <table>.
call method cl_gui_frontend_services=>gui_upload
exporting
filename = iv_filename
has_field_separator = abap_true
changing
data_tab = <table>
exceptions
others = 4.
if sy-subrc <> 0.
"Some appropriate error handling
"message id sy-msgid type 'I'
" number sy-msgno
" with sy-msgv1 sy-msgv2
" sy-msgv3 sy-msgv4.
return.
endif.
modify (p_name) from table <table>.
"write: / sy-tabix, ' entries updated'.
catch cx_root into lx_root.
"lv_text = lx_root->get_text( ).
"some appropriate error handling
return.
endtry.
endmethod.
This would still require that you make sure that the exported file matches the table that you want to import. However cl_gui_frontend_services=>gui_upload should return sy-subrc > 0 in that case, so you can bail out before you corrupt any data.
Original Answer:
I'll assume that you want to update a z-table and not a SAP standard table.
You will probably have to format your datafile a little bit to make it tab or comma delimited.
You can then upload the data file using cl_gui_frontend_services=>gui_upload
Then if you want to overwrite the existing data in the table you can use
modify zmydbtab from table it_importeddata.
If you do not want to overwrite existing entries you can use.
insert zmydbtab from table it_importeddata.
You will get a return code of sy-subrc = 4 if any of the keys already exists, but any new entries will be inserted.
Note
There are many reasons why you would NOT do this for a SAP-standard table. Most prominent is that there is almost always more to the data-model than what we are aware of. Also when creating transactional data, there are often follow-on events or workflow that kicks off, that will not be the case if you're updating the database directly. As a rule of thumb, it is usually a bad idea to update SAP standard tables directly.
In that case try to find a BADI, or if that's not available, record a BDC and do the updates that way.

If the system landscape was setup correctly, your client would not need any kind of basis operations support whatsoever to perform the transports. So instead of re-inventing the wheel, I'd strongly suggest to catch up on what the CTS and TMS can do once they're setup with sensible settings.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

USQL nested query performance - query-performance

Related

Databricks SQL nondeterministic expressions using DELETE FROM

System Resource Exceeded when alter table counter (autonumber)

Spark: Use Temporary Table Twice in Query?

Alternative to over(partition...) function, because it is not supported

Can I import SAP tables that were exported by SE16?

Categories

Resources