JOIN in Azure Stream Analytics - azure

I have a requirement to validate the values of one column with a master data in stream analytics.
I have written queries to fetch some data from a blob location and One of the column value should be validated against a master data available in another blob location.
Below is the SAQL I tried. signals1 is the master data in blob and signals2 is the data processed and to be validated:
WITH MASTER AS (
SELECT [signals1].VAL as VAL
FROM [signals1]
)
SELECT
ID,
VAL,
SIG
INTO [output]
FROM signals2
I have to check the VAL from signals2 to be validated against VAL in signals1.
If the VAL in signals2 is there in signals1, then we should write to output.
If the VAL in signals2 is not there in signals1, then that doc should be ignored(should not write into output).
I tried with JOIN and WHERE clause, but not working as expected.
Any leads, how to achieve this using JOIN or WHERE?

In case your Signal1 data is the reference input, and Signal2 is the streaming input, you can use something like the following query:
with signals as (select * from Signal2 I join Signal1 R ON I.Val = R.Val)
select * into output from signals
I tested this query locally, and I assumed that your reference data(Signal1) is in the format:
[
{
"Val":"123",
"Data":"temp"
},
{
"Val":"321",
"Data":"humidity"
}
]
And for example, your Signal2 - the streaming input is:
{
"Val":"123",
"SIG":"k8s23kk",
"ID":"1234589"
}
Have a look at this query and data samples to see if it can guide you towards the solution.
Side note you cannot use this join in case that Signal1 is the streaming data. The way these types of joins are working is that you have to use time-windowing. Without that is not possible.

Related

Delta Live Tables and ingesting AVRO

So, im trying to load avro files in to dlt and create pipelines and so fourth.
As a simple data frame in Databbricks, i can read and unpack to avro files, using functions json / rdd.map /lamba function. Where i can create a temp view then do a sql query and then select the fields i want.
--example command
in_path = '/mnt/file_location/*/*/*/*/*.avro'
avroDf = spark.read.format("com.databricks.spark.avro").load(in_path)
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
data.createOrReplaceTempView("eventhub")
--selecting the data
sql_query1 = sqlContext.sql("""
select distinct
data.field.test1 as col1
,data.field.test2 as col2
,data.field.fieldgrp.city as city
from
eventhub
""")
However, i am trying to replicate the process , but use delta live tables and pipelines.
I have used autoloader to load the files into a table, and kept the format as is. So bronze is just avro in its rawest form.
I then planned to create a view that listed the unpack avro file. Much like I did above with "eventhub". Whereby it will then allow me to create queries.
The trouble is, I cant get it to work in dlt. I fail at the 2nd step, after i have imported the file into a bronze layer. It just does not seem to apply the functions to make the data readable/selectable.
This is the sort of code i have been trying. However, it does not seem to pick up the schema, so it is as if the functions are not working. so when i try and select a column, it does not recognise it.
--unpacked data
#dlt.view(name=f"eventdata_v")
def eventdata_v():
avroDf = spark.read.format("delta").table("live.bronze_file_list")
jsonRdd = avroDf.select(avroDf.Body.cast("string")).rdd.map(lambda x: x[0])
data = spark.read.json(jsonRdd)
return data
--trying to query the data but it does not recognise field names, even when i select "data" only
#dlt.view(name=f"eventdata2_v")
def eventdata2_v():
df = (
dlt.read("eventdata_v")
.select("data.field.test1 ")
)
return df
I have been working on this for weeks, trying to use different approach's but still no luck.
Any help will be so appreciated. Thankyou

How do you setup a Synapse Serverless SQL External Table over partitioned data?

I have setup a Synapse workspace and imported the Covid19 sample data into a PySpark notebook.
blob_account_name = "pandemicdatalake"
blob_container_name = "public"
blob_relative_path = "curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s#%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
df = spark.read.parquet(wasbs_path)
I have then partitioned the data by country_region, and written it back down into my storage account.
df.write.partitionBy("country_region") /
.mode("overwrite") /
.parquet("abfss://rawdata#synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/")
All that works fine as you can see. So far I have only found a way to query data from the exact partition using OPENROWSET, like this...
SELECT
TOP 100 *
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=Afghanistan/**',
FORMAT = 'PARQUET'
) AS [result]
I want to setup an Serverless SQL External table over the partition data, so that when people run a query and use "WHERE country_region = x" it will only read the appropriate partition. Is this possible, and if so how?
You need to get the partition value using the filepath function like this. Then filter on it. That achieves partition elimination. You can confirm by the bytes read compared to when you don’t filter on that column.
CREATE VIEW MyView
As
SELECT
*, filepath(1) as country_region
FROM
OPENROWSET(
BULK 'https://synapsepoc.dfs.core.windows.net/synapsepoc/Covid19/country_region=*/*',
FORMAT = 'PARQUET'
) AS [result]
GO
Select * from MyView where country_region='Afghanistan'

Azure Stream Analytics Reference Input Join

0
I am trying to use following query to join stream input (deviceinput) and reference input (refinputpdjson):
Following are my Inputs for Stream Analytics Job:
Input 1: Stream Input from IoT Hub
Input 2: Reference Data from Azure Blob Storage
SELECT
din.EventProcessedUtcTime,
din.deviceid as streamdeviceid,
din.heartrate as streamheartrate,
refin.deviceid as refdeviceid,
refin.patientid as refpatientid
FROM
deviceinput din
TIMESTAMP BY EventProcessedUtcTime
LEFT OUTER JOIN
refinputpdjson refin
ON din.deviceid = refin.deviceid
but its failing with following reasons:
The join predicate is not time bounded. JOIN operation between data streams requires specifying max time distances between matching events. Please add DATEDIFF to the JOIN condition. Example: SELECT input1.a, input2.b FROM input1 JOIN input2 ON DATEDIFF(minute, input1, input2) BETWEEN 0 AND 10
As error shows,
JOIN operation between data streams requires specifying max time
distances between matching events.
So you can try something like this:
SELECT
din.EventProcessedUtcTime, din.deviceid as streamdeviceid, din.heartrate as streamheartrate, refin.deviceid as refdeviceid, refin.patientid as refpatientid
FROM deviceinput din TIMESTAMP BY EventProcessedUtcTime
LEFT OUTER JOIN refinputpdjson refin TIMESTAMP BY EventEndUtcTime
ON din.deviceid = refin.deviceid
AND DATEDIFF(minute,din,refin) BETWEEN 0 AND 15
More detail, you can refer to this documentation.
It was a Azure service bug. Its resolved by MS Team.
Its working fine now.
Thank you,
Kamlesh Khollam

how to implement timeWindow() in Apache Flink's StreamTableEnvironment?

everyone,
I want to use flink time window in StreamTableEnvironment.
I have previously used the timeWindow(Time.seconds()) function with a dataStream that comes from a kafka topic.
For external issues I am converting this DataStream to DataTable and applying a SQL query with sqlQuery().
I want to do x second time window aggregations with SQL and then send it to another kafka topic
Data source:
val stream = senv
.addSource(new FlinkKafkaConsumer[String]("flink", new SimpleStringSchema(), properties))
example of previous aggregation:
val windowCounts = stream.keyBy("x").timeWindow(Time.seconds(5), Time.seconds(5))
Current DataTable:
val tableA = tableEnv.fromDataStream(parsed, 'user, 'product, 'amount)
In this part there should be a query that makes an aggregation each X time
val result = tableEnv.sqlQuery(
s"SELECT * FROM $tableA WHERE amount > 2".stripMargin)
more or less my aggregation will be count(y) OVER(PARTITION BY x)
Thank you!
Ververica's training for Flink SQL will help you with this. In includes some exercises/examples that cover just this kind of query in the section on Querying Dynamic Tables with SQL.
You'll have to establish the source of timing information for each event, which can be either processing time or event time, after which the query corresponding to stream.keyBy("x").timeWindow(Time.seconds(5), Time.seconds(5)) will be something like this:
SELECT
x,
TUMBLE_END(timestamp, INTERVAL '5' SECOND) AS t,
COUNT(*) AS cnt
FROM Events
GROUP BY
x, TUMBLE(timestamp, INTERVAL '5' SECOND);
For details on how to work with time attributes, see the Introduction to Time Attributes.
And for more detailed documentation on windowing with Flink SQL see the docs on Group Windows.

Azure stream analytics - Joining on a csv file returns 0 rows

I have the following query:
SELECT
[VanList].deviceId
,[VanList].[VanName]
events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [timestamp]
join [VanList] on events.DeviceId = [VanList].deviceId
where iot-eventHub is my event hub and VanList is a reference list (csv file) that has been uploaded to azure storage.
I have tried uploading sample data to test the query, but it always returns 0 rows.
Below is a sample of the JSON captured by my Event Hub Input
[
{
"DeviceId":1,
"Timestamp":"2015-06-29T12:15:18.0000000",
"ExternalTemp":9,
"InternalTemp":8,
"Humidity":43,
"Latitude":51.3854942,
"Longitude":-1.12774682,
"EventProcessedUtcTime":"2015-06-29T12:25:46.0932317Z",
"PartitionId":1,
"EventEnqueuedUtcTime":"2015-06-29T12:15:18.5990000Z"
} ]
Below is a sample of my CSV reference data.
deviceId,VanName
1,VAN 1
2,VAN 2
3,Standby Van
Both lists contain a device id of 1, so I am expecting my query to be able to join the two together.
I have tried using both "inner join" and "join" in my query syntax, but neither result in a successful join.
What is wrong with my Stream Analytics query?
Try adding a CAST function in the join. I'm not sure why that works and adding a CREATE TABLE clause for the VanList reference data input doesn't accomplish the same thing. But I think this works.
SELECT
[VanList].deviceId
,[VanList].[VanName]
,events.[timestamp]
,events.externaltemp
,events.internaltemp
,events.humidity
,events.latitude
,events.longitude
INTO
[iot-powerBI]
FROM
[iot-EventHub] as events timestamp by [Timestamp]
join [VanList] on events.DeviceId = cast([VanList].deviceId as bigint)
The only thing I can see is you are missing a comma in your original query, otherwise it looks correct. I would try recreating the Stream Analytics job. Here is another example that worked for me.
SELECT
countryref.CountryName as Geography,
input.GeographyId as GeographyId
into [country-out]
FROM input timestamp by [TransactionDateTime]
Join countryref
on countryref.GeographyID = input.GeographyId here
Input data example
{"pageid":801,"firstname":"Gertrude","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"2ggnC"}
{"pageid":801,"firstname":"Venice","geographyid":1,"itemid":10,"itemprice":169.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"xLyOp"}
{"pageid":801,"firstname":"Christinia","geographyid":2,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"VuycQ"}
{"pageid":801,"firstname":"Dorethea","geographyid":4,"itemid":2,"itemprice":79.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"tgvQP"}
{"pageid":801,"firstname":"Dwain","geographyid":4,"itemid":4,"itemprice":129.0,"transactiondatetime":"2015-06-30T14:25:51.0000000","creditcardnumber":"O5TwV"}
Country ref data
[
{
"GeographyID":1,
"CountryName":"USA"
},
{
"GeographyID":2,
"CountryName":"China"
},
{
"GeographyID":3,
"CountryName":"Brazil"
},
{
"GeographyID":4,
"CountryName":"Andrews country"
},
{
"GeographyID":5,
"CountryName":"Chile"
}
]

Resources