virtual file set column and rowset variable U-SQL - azure

I'm having an issue with scheduling job in Data Factory.
I'm trying to approach a scheduled job per hour which will execute the same script each hour with different condition.
Consider I have a bunch of Avro Files spread in Azure Data Lake Store with following pattern.
/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}{date:MM}{date:dd}__{date:H}
Each hour new files are added to Data Lake Store.
In order to process the files only once I decided to handle them by help of U-SQL virtual file set column and some SyncTable which i created in Data Lake Store.
My query looks like following.
DECLARE #file_set_path string = /Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H};
#result = EXTRACT [Id] long,
....
date DateTime
FROM #file_set_path
USING someextractor;
#rdate =
SELECT MAX(ProcessedDate) AS ProcessedDate
FROM dbo.SyncTable
WHERE EntityName== "SomeEntity";
#finalResult = SELECT [Id],... FROM #result
CROSS JOIN #rdate AS r
WHERE date >= r.ProcessedDate;
since I can't use rowset variable in where clause I'm cross joining the singe row with set , however even in this case U-SQL won't find the correct files and always return all files set.
Is there any workaround or other approach ?

I think this approach should work unless there is something not quite right somewhere, ie can you confirm the datatypes of the dbo.SyncTable table? Dump out #rdate and make sure the value you get there is what you expect.
I put together a simple demo which worked as expected. My copy of SyncTable had one record with the value of 01/01/2018:
#working =
SELECT *
FROM (
VALUES
( (int)1, DateTime.Parse("2017/12/31") ),
( (int)2, DateTime.Parse("2018/01/01") ),
( (int)3, DateTime.Parse("2018/02/01") )
) AS x ( id, someDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM dbo.SyncTable;
//#output =
// SELECT *
// FROM #rdate;
#output =
SELECT *, (w.someDate - r.maxDate).ToString() AS diff
FROM #working AS w
CROSS JOIN
#rdate AS r
WHERE w.someDate >= r.maxDate;
OUTPUT #output TO "/output/output.csv"
USING Outputters.Csv();
I did try this with a filepath (full script here). The thing to remember is the custom date format H represents the hour as a number from 0 to 23. If your SyncTable date does not have a time component to it when you insert it, it will default to midnight (0), meaning the whole day will be collected. Your file structure should look something like this according to your pattern:
"D:\Data Lake\USQLDataRoot\Data\SomeEntity\2017\12\31\SomeEntity_2017_12_31__8\test.csv"
I note your filepath has underscores in the second section and a double underscore before the hour section (which will be between 0 and 23, single digit up to the hour 10). I notice your fileset path does not have a file type or quotes - I've used test.csv in my tests. My results:
Basically I think the approach will work, but there is something not quite right, maybe in your file structure, the value in your SyncTable, the datatype etc. You need to go over the details, dump out intermediate values to check until you find the problem.

Doesn't the gist of wBob's full script resolve your issue? Here is a very slightly edited version of wBob's full script to address some of the issues you raised:
Ability to filter on SyncTable,
last part of pattern is file name and not folder. Sample file and structure: \Data\SomeEntity\2018\01\01\SomeEntity_2018_01_01__1
DECLARE #file_set_path string = #"/Data/SomeEntity/{date:yyyy}/{date:MM}/{date:dd}/SomeEntity_{date:yyyy}_{date:MM}_{date:dd}__{date:H}";
#input =
EXTRACT [Id] long,
date DateTime
FROM #file_set_path
USING Extractors.Text();
// in lieu of creating actual table
#syncTable =
SELECT * FROM
( VALUES
( "SomeEntity", new DateTime(2018,01,01,01,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,01,00,00) ),
( "SomeEntity", new DateTime(2018,01,01,00,00,00) ),
( "AnotherEntity", new DateTime(2018,01,01,00,00,00) ),
( "SomeEntity", new DateTime(2017,12,31,23,00,00) ),
( "AnotherEntity", new DateTime(2017,12,31,23,00,00) )
) AS x ( EntityName, ProcessedDate );
#rdate =
SELECT MAX(ProcessedDate) AS maxDate
FROM #syncTable
WHERE EntityName== "SomeEntity";
#output =
SELECT *,
date.ToString() AS dateString
FROM #input AS i
CROSS JOIN
#rdate AS r
WHERE i.date >= r.maxDate;
OUTPUT #output
TO "/output/output.txt"
ORDER BY Id
USING Outputters.Text(quoting:false);

Also please note that file sets cannot perform partition elimination on dynamic joins, since the values are not known to the optimizer during the preparation phase.
I would suggest to pass the Sync point as a parameter from ADF to the processing script. Then the value is known to the optimizer and file set partition elimination will kick in. In the worst case, you would have to read the value from your sync table in a previous script and use it as a parameter in the next.

Related

How to make my identity column consecutive on delta table in Azure Databricks?

I am trying to create a delta table with a consecutive identity column. The goal is for our clients to see if there is some data they did not receive from us.
It looks like the generated identity column is not consecutive. Which makes the "INCREMENT BY 1" quite misleading.
store_visitor_type_name = ["apple","peach","banana","mango","ananas"]
card_type_name = ["door","desk","light","coach","sink"]
store_visitor_type_desc = ["monday","tuesday","wednesday","thursday","friday"]
colnames = ["column2","column3","column4"]
data_frame = spark.createDataFrame(zip(store_visitor_type_name,card_type_name,store_visitor_type_desc),colnames)
data_frame.createOrReplaceTempView('vw_increment')
data_frame.display()
%sql
CREATE or REPLACE TABLE TEST(
`column1SK` BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1)
,`column2` STRING
,`column3` STRING
,`column4` STRING
,`inserted_timestamp` TIMESTAMP
,`modified_timestamp` TIMESTAMP
)
USING delta
LOCATION '/mnt/Marketing/Sales';
MERGE INTO TEST as target
USING vw_increment as source
ON target.`column2` = source.`column2`
WHEN MATCHED
AND (target.`column3` <> source.`column3`
OR target.`column4` <> source.`column4`)
THEN
UPDATE SET
`column2` = source.`column2`
,`modified_timestamp` = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (
`column2`
,`column3`
,`column4`
,`modified_timestamp`
,`inserted_timestamp`
) VALUES (
source.`column2`
,source.`column3`
,source.`column4`
,current_timestamp()
,current_timestamp()
)
I'm getting the following results. You can see this is not sequential.What is also very confusing is that it is not starting at 1, while explicitely mentionned in the query.
I can see in the documentation (https://docs.databricks.com/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html#parameters) :
The automatically assigned values start with start and increment by
step. Assigned values are unique but are not guaranteed to be
contiguous. Both parameters are optional, and the default value is 1.
step cannot be 0.
Is there a workaround to make this identity column consecutive ?
I guess I could have another column and do a ROW_NUMBER operation after the MERGE, but it looks expensive.
You can utilize Pyspark to achieve the requirement instead of using row_number() function.
I have read the TEST table as a spark dataframe and converted it to pandas on spark dataframe. In pandas dataframe, using reset_index(), I have created a new index column.
Then I have converted it back to spark dataframe. I have added 1 to the index column values since the index starts with 0.
df = spark.sql("select * from test")
pdf = df.to_pandas_on_spark()
#to create new index column.
pdf.reset_index(inplace=True)
final_df = pdf.to_spark()
#Since index starts from 0, I have added 1 to it.
final_df.withColumn('index',final_df['index']+1).show()

cross apply an array of values recorded every 10 mins from a timestamp and generate their timestamps in stream analytics

I have the following stream analytics input:
{ "ID":"DEV-001-Test",
"TMSMUTC":"2021-10-14T14:00:00.000",
"MSGTYP":"TELEMETRY",
"THING":[
{
"TMSDUTC":"2021-10-14T13:00:00.000",
"DATA":[
{
"TAGID":"TAGB",
"VALUE":30
},
{
"TAGID":"TAGX",
"VALUE":[30.34,245.65,30.34,245.65,245.65,30.34]
}
]
}
]
}
in which the array of values for the "TAGX" is representing a value recorded from a sensor every 10 mins for one hour from the timestamp "TMSDUTC":"2021-10-14T13:00:00.000".
I was wondering how could make a query that would give me a similar output:
output
my main doubts are how to create the sequence of 10 mins from the timestamp and cross apply the values to it.
That's a good one! Note that I highly recommend you use VSCode and the ASA extension when working on these queries. The developer experience is much nicer than in the portal thanks to local testing, and you can also unit test your query via the npm package.
I took the following assumptions:
THING is an array of a single record. Let me know if that's not the case
[edited] TMSDUTC needs to be incremented by 10 minutes according to the position of each item in the array when applicable (TAGX)
With that, here is the query. It's split in multiple code blocks to explain the flow, but I also pasted it whole in the last code block.
First we bring all the required fields to the first level. It makes things easier to read, but not only. GetArrayElements needs an array to CROSS APPLY, but GetArrayElement (singular) doesn't return the type at compile time. Using an intermediary query step solves that.
WITH things AS (
SELECT
ID,
GetArrayElement(THING,0).TMSDUTC AS TMSDUTC,
MSGTYP AS MessageType,
GetArrayElement(THING,0).DATA AS DATA
FROM [input]
),
Then we expand DATA:
dataAll AS (
SELECT
T.ID,
T.TMSDUTC,
T.MessageType,
D.ArrayValue.TAGID AS Tag,
D.ArrayValue.Value AS [Value]
FROM things T
CROSS APPLY GetArrayElements(T.DATA) AS D
),
Then we create a subset for records that have a VALUE of type array (TAGX in your example). Here I avoid hard-coding per tag by detecting the type at runtime. These records will need another round of array processing in the following step.
dataArrays AS (
SELECT
A.ID,
A.TMSDUTC,
A.MessageType,
A.Tag,
A.[Value]
FROM dataAll A
WHERE GetType(A.[Value]) = 'array'
),
Now we can focus on expanding VALUE for those records. Note that we could not do that in a single pass (filter on arrays above and CROSS APPLY below), as GetArrayElements checks types before filtering is done.
[edited] To increment TMSDUTC, we use DATEADD on the index of each item in its array (ArrayIndex/ArrayValue are both returned from the array expansion, see doc below).
dataArraysExpanded AS (
SELECT
A.ID,
DATEADD(minute,10*V.ArrayIndex,A.TMSDUTC) AS TMSDUTC,
A.MessageType,
A.Tag,
V.ArrayValue AS [Value]
FROM dataArrays A
CROSS APPLY GetArrayElements(A.[Value]) AS V
),
We union back everything together:
newSchema AS (
SELECT ID, TMSDUTC, MessageType, Tag, [Value] FROM dataAll WHERE GetType([Value]) != 'array'
UNION
SELECT ID, TMSDUTC, MessageType, Tag, [Value] FROM dataArraysExpanded
)
And finally insert everything into the destination:
SELECT
*
INTO myOutput
FROM newSchema
[edited] Please note that the only order guaranteed on a result set is the one defined by the timestamp. If multiple records occur on the same timestamp, no order is guaranteed by default. Here, at the end of the query, all of the newly created events are still timestamped on the timestamp of the original event. If you now need to apply time logic on the newly generated TMSDUTC, you will need to output these records to Event Hub, and load them in another job using TIMESTAMP BY TMSDUTC. Currently the timestamp can only be changed directly at the very first step of a query.
What is used here :
GetArrayElement (singular) : doc
WITH aka Common Table Expression (CTE) : doc
CROSS APPLY + GetArrayElements : doc and doc, plus very good ref
GetType : doc
The entire thing for easier copy/pasting:
WITH things AS (
SELECT
ID,
GetArrayElement(THING,0).TMSDUTC AS TMSDUTC,
MSGTYP AS MessageType,
GetArrayElement(THING,0).DATA AS DATA
FROM [input]
),
dataAll AS (
SELECT
T.ID,
T.TMSDUTC,
T.MessageType,
D.ArrayValue.TAGID AS Tag,
D.ArrayValue.Value AS [Value]
FROM things T
CROSS APPLY GetArrayElements(T.DATA) AS D
),
dataArrays AS (
SELECT
A.ID,
A.TMSDUTC,
A.MessageType,
A.Tag,
A.[Value]
FROM dataAll A
WHERE GetType(A.[Value]) = 'array'
),
dataArraysExpanded AS (
SELECT
A.ID,
DATEADD(minute,10*V.ArrayIndex,A.TMSDUTC) AS TMSDUTC,
A.MessageType,
A.Tag,
V.ArrayValue AS [Value]
FROM dataArrays A
CROSS APPLY GetArrayElements(A.[Value]) AS V
),
newSchema AS (
SELECT ID, TMSDUTC, MessageType, Tag, [Value] FROM dataAll WHERE GetType([Value]) != 'array'
UNION
SELECT ID, TMSDUTC, MessageType, Tag, [Value] FROM dataArraysExpanded
)
SELECT
*
INTO myOutput
FROM newSchema

Flatten two arrays having corresponding values using mapping data flow in azure data factory

I am new to data flows in adf. I have a set of json files in a folder which i would like to parse and flatten two arrays into a csv.
The json structure is as follows:
I would like to flatten the timestamps and values arrays. Each timestamp in Timestamps array has the corresponding value in the Values array. i.e, value at Timestamps[i] is Values[i]. The output should have a row for each timestamp and corresponding value for a particular label.
Can anyone please help me in achieving this
Thank you!
To do the correlation between 2 arrays use mapIndex function and merge the arrays together. Then you will use a flatten to get it into relational form.
mapIndex(Timestamps, #(
Timestamp = #item, Value = Values[#index]))
In your data flow, add 2 Flatten transformations and point to the Timestamps and Values arrays in the unroll property in each transformation.
I know this is rather too late for an answer but I came across this when I was struggling to get the example working. Taking what Kiran wrote and expanding on it, I've written it up as a full end-to-end blog post which is here; it might save others time in future
https://medium.com/kinandcartacreated/consuming-informatica-api-data-with-data-factory-and-data-flows-a3220f6190d3
You could also do this with Azure Synapse Analytics, serverless SQL pools which supports OPENJSON:
SELECT
JSON_VALUE( m.value, '$.Id' ) Id,
JSON_VALUE( m.value, '$.Label' ) Label,
t.value AS ts,
v.value AS v
FROM
OPENROWSET(
BULK 'https://somestorage.dfs.core.windows.net/datalake/raw/json/metrics.json',
FORMAT = 'CSV',
FIELDQUOTE = '0x0b',
FIELDTERMINATOR ='0x0b',
ROWTERMINATOR = '0x0b'
)
WITH (
jsonContent varchar(MAX)
) AS [result]
CROSS APPLY OPENJSON( jsonContent, '$.MetricDataResults' ) m
CROSS APPLY OPENJSON( m.value, '$.Timestamps' ) t
CROSS APPLY OPENJSON( m.value, '$.Values' ) v
WHERE t.[key] = v.[key];
My results:
It's a valid alternative approach with a pay as you go service.

how to concatenate multiple row or column data into one row or column from a text file while importing data into db2 table

For Eg:
1)File has
ID|Name|job|hobby|salary|hobby2
2)Data:
1|ram|architect|tennis|20000|cricket
1|ram|architect|football|20000|gardening
2|krish|teacher|painting|25000|cooking
3)Table:
Columns in table: ID-Name-Job-Hobby-Salary
Is it possible to load data into table as below:
1-ram-architect-tenniscricketfootbalgardening-20000
2-krish-teacher-paintingcooking-25000
Command: db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Job,Hobby,salary) nonrecoverable"
You cannot achieve what you think you want in a single action with either LOAD CLIENT or IMPORT.
You are asking to denormalize, and I presume you understand the consequences.
Regardless, you can use a multi-step approach, first load/import into a temporary table, and then in a second step use SQL to denormalize into the final table, before discarding the temporary table.
Or if you are adept with awk , and the data file is correctly sorted, then you can pre-process the file externally to a database before load/import.
Or use an ETL tool.
You may use the INGEST command instead of LOAD.
You must create the corresponding infrastructure for this command beforehand with the following command, for example:
CALL SYSINSTALLOBJECTS('INGEST', 'C', 'USERSPACE1', NULL);
Load your file afterwards with the following command:
INGEST FROM FILE ABC.FILE
FORMAT DELIMITED by '|'
(
$id INTEGER EXTERNAL
, $name CHAR(8)
, $job CHAR(20)
, $hobby CHAR(20)
, $salary INTEGER EXTERNAL
, $hobby2 CHAR(20)
)
MERGE INTO tablename
ON ID = $id
WHEN MATCHED THEN
UPDATE SET hobby = hobby CONCAT $hobby CONCAT $hobby2
WHEN NOT MATCHED THEN
INSERT (ID, NAME, JOB, HOBBY, SALARY) VALUES($id, $name, $job, $hobby CONCAT $hobby2, $salary);

Subsonic 3 Simple Query inner join sql syntax

I want to perform a simple join on two tables (BusinessUnit and UserBusinessUnit), so I can get a list of all BusinessUnits allocated to a given user.
The first attempt works, but there's no override of Select which allows me to restrict the columns returned (I get all columns from both tables):
var db = new KensDB();
SqlQuery query = db.Select
.From<BusinessUnit>()
.InnerJoin<UserBusinessUnit>( BusinessUnitTable.IdColumn, UserBusinessUnitTable.BusinessUnitIdColumn )
.Where( BusinessUnitTable.RecordStatusColumn ).IsEqualTo( 1 )
.And( UserBusinessUnitTable.UserIdColumn ).IsEqualTo( userId );
The second attept allows the column name restriction, but the generated sql contains pluralised table names (?)
SqlQuery query = new Select( new string[] { BusinessUnitTable.IdColumn, BusinessUnitTable.NameColumn } )
.From<BusinessUnit>()
.InnerJoin<UserBusinessUnit>( BusinessUnitTable.IdColumn, UserBusinessUnitTable.BusinessUnitIdColumn )
.Where( BusinessUnitTable.RecordStatusColumn ).IsEqualTo( 1 )
.And( UserBusinessUnitTable.UserIdColumn ).IsEqualTo( userId );
Produces...
SELECT [BusinessUnits].[Id], [BusinessUnits].[Name]
FROM [BusinessUnits]
INNER JOIN [UserBusinessUnits]
ON [BusinessUnits].[Id] = [UserBusinessUnits].[BusinessUnitId]
WHERE [BusinessUnits].[RecordStatus] = #0
AND [UserBusinessUnits].[UserId] = #1
So, two questions:
- How do I restrict the columns returned in method 1?
- Why does method 2 pluralise the column names in the generated SQL (and can I get round this?)
I'm using 3.0.0.3...
So far my experience with 3.0.0.3 suggests that this is not possible yet with the query tool, although it is with version 2.
I think the preferred method (so far) with version 3 is to use a linq query with something like:
var busUnits = from b in BusinessUnit.All()
join u in UserBusinessUnit.All() on b.Id equals u.BusinessUnitId
select b;
I ran into the pluralized table names myself, but it was because I'd only re-run one template after making schema changes.
Once I re-ran all the templates, the plural table names went away.
Try re-running all 4 templates and see if that solves it for you.

Resources