Stream Analytics - Processing JSON with no array name - azure

IoT Edge v2 with the modbus module sends data to IoT Hub in the format of:
[
{
"DisplayName": "Voltage",
"HwId": "",
"Address": "400001",
"Value": "200",
"SourceTimestamp": "2019-01-03 23:40:24"
},
{
"DisplayName": "Voltage",
"HwId": "",
"Address": "400002",
"Value": "24503",
"SourceTimestamp": "2019-01-03 23:40:24"
},
...
]
I want to convert this array to rows using a stream analytics query containing the CROSS APPLY GetArrayElements() but this function requires an array name. Obviously there is no name. Any suggestions?
https://learn.microsoft.com/en-us/stream-analytics-query/getarrayelements-azure-stream-analytics
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-parsing-json

Yes, it needs an array name. CROSS APPLY GetArrayElements() is used for nested array.
Example:
[{
"source": "xda",
"data":
[{
"masterTag": "UNIFY1",
"speed": 180
},
{
"masterTag": "UNIFY2",
"speed": 180
}],
"EventEnqueuedUtcTime": "2018-07-20T19:28:18.5230000Z",
},
{
"source": "xda",
"data": [{
"masterTag": "UNIFY3",
"speed": 214
},
{
"masterTag": "UNIFY4",
"speed": 180
}],
"EventEnqueuedUtcTime": "2018-07-20T19:28:20.5550000Z",
}
]
You could use below sql to convert it to rows:
SELECT
jsoninput.source,
arrayElement.ArrayValue.masterTag
INTO
output
FROM jsoninput
CROSS APPLY GetArrayElements(jsoninput.data) AS arrayElement
However ,now the input data you provided is a pure array. If you want to convert this array to rows, just use sql:
select jsoninput.* from jsoninput

You don't have to use GetArrayElements. Just selecting json array as input format is enough. Stream analytics reads each object in the array as a record. Same with line or whitespace separated jain objects, each object is read as a record.

Related

Cosmos Db: How to query for the maximum value of a property in an array of arrays?

I'm not sure how to query when using CosmosDb as I'm used to SQL. My question is about how to get the maximum value of a property in an array of arrays. I've been trying subqueries so far but apparently I don't understand very well how they work.
In an structure such as the one below, how do I query the city with more population among all states using the Data Explorer in Azure:
{
"id": 1,
"states": [
{
"name": "New York",
"cities": [
{
"name": "New York",
"population": 8500000
},
{
"name": "Hempstead",
"population": 750000
},
{
"name": "Brookhaven",
"population": 500000
}
]
},
{
"name": "California",
"cities":[
{
"name": "Los Angeles",
"population": 4000000
},
{
"name": "San Diego",
"population": 1400000
},
{
"name": "San Jose",
"population": 1000000
}
]
}
]
}
This is currently not possible as far as I know.
It would look a bit like this:
SELECT TOP 1 state.name as stateName, city.name as cityName, city.population FROM c
join state in c.states
join city in state.cities
--order by city.population desc <-- this does not work in this case
You could write a user defined function that will allow you to write the query you probably expect, similar to this: CosmosDB sort results by a value into an array
The result could look like:
SELECT c.name, udf.OnlyMaxPop(c.states) FROM c
function OnlyMaxPop(states){
function compareStates(stateA,stateB){
stateB.cities[0].poplulation - stateA.cities[0].population;
}
onlywithOneCity = states.map(s => {
maxpop = Math.max.apply(Math, s.cities.map(o => o.population));
return {
name: s.name,
cities: s.cities.filter(x => x.population === maxpop)
}
});
return onlywithOneCity.sort(compareStates)[0];
}
You would probably need to adapt the function to your exact query needs, but I am not certain what your desired result would look like.

Spark streaming Dynamic Schema Evolution from Kafka Eventhub on Microbatch

We are streaming data from the Kafka Eventhub. The records may have a nested structure. The schema is inferred dynamically from the data and the Delta table is formed with the schema of the first incoming batch of data.
Note: The data read from Kafka topic will be a whole JSON string. Hence,
When we apply schema and convert to a dataframe, we lose the fields' values with mismatch datatype or newly added fields.
When we do spark.read.json, the entire field values are converted to String.
We encounter a situation where the Source data has some schema changes. Some of the scenarios we faced are :
The datatype changes at the parent level
The datatype changes at the nested level
There are duplicate keys in a different case
There are the addition of new fields
A sample Source data with the Actual schema
{
"Id": "101",
"Name": "John",
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": "10001"
},
"Experience": 2,
"Organization": [
{
"Id": "Org101",
"Name": "Google"
},
{
"Id": "Org102",
"Name": "Microsoft"
}
]
}
A sample Source data addressing the 4 points mentioned above
{
"Id": "102",
"name": "Rambo", --- Point 3
"Department": {
"Id": "Dept101",
"Name": "Technology",
"EmpId": 10001 ---- Point 2
},
"Experience": "2", --- Point 1
"Organization": [
{
"Id": "Org101",
"Name": "Google",
"Experience": "2", --- Point 4
},
{
"Id": "Org102",
"Name": "Microsoft",
"Experience": "2",
}
]
}
We need a solution to overcome the above issues. Though it's difficult to embed the new schema to the existing delta table, at least we should be able to separate the records with schema changes without losing the original data.

Get Message object properties in stream analytics

This is a sample JSON input packet. I'm writing transformation queries to get the data and it is working fine.
[{
"source": "xda",
"data":
[{
"masterTag": "UNIFY",
"speed": 180
}],
"EventEnqueuedUtcTime": "2018-07-20T19:28:18.5230000Z",
},
{
"source": "xda",
"data": [{
"masterTag": "UNIFY",
"speed": 214
}],
"EventEnqueuedUtcTime": "2018-07-20T19:28:20.5550000Z",
}
]
However, a custom property has been added to the message object when it is sent to IoT hub by the name of "proFilter". This is not inside the payload, but is present in the message object. I can get this property using Azure function but I'm not sure how to get it in Stream Analytics transformation query. Is there any way I can get it?
Basic transformation query:
WITH data AS
(
SELECT
source,
GetArrayElement(data,0) as data_packet
FROM input
)
SELECT
source,
data_packet.masterTag
INTO
output
FROM data
Include the following function in your SELECT statement:
GetMetadataPropertyValue(input, '[User].[proFilter]') AS proFilter
If you are interested in retrieving all your custom properties as a record, you can use
GetMetadataPropertyValue(input, '[User]') AS userprops
See this doc for further reference

Can I get CosmosDB graph to return edge details for vertex objects in query results?

Consider the following simple gremlin query: g.V("some_id")
When executed against my CosmosDB graph database from the "Data Explorer" tab of the Azure web UI, I get the following results:
[{
"id": "some_id",
"label": "some_type
"type": "vertex",
"outE": {
"some_edge": [{
"id": "75b3c6ff-efdf-4a88-8cf6-aa395ef28bf7",
"inV": "another_id"
},
{
"id": "f3703292-12b9-44bc-a16f-26bac75f3420",
"inV": "yet_another_id"
}
]
},
"properties": {
"some_property": [{
"id": "50bda5cb-08ab-4727-b212-5ba4e829db3e|organizationId",
"value": "hi there"
}]
}
}]
When I execute the same exact query against the same exact database using the gremlin websocket endpoint, I get the following results:
[{
"id": "some_id",
"label": "some_type
"type": "vertex",
"properties": {
"some_property": [{
"id": "50bda5cb-08ab-4727-b212-5ba4e829db3e|organizationId",
"value": "hi there"
}]
}
}]
What happened to the edges (the "outE" JSON key)? Only the "properties" key is included, but man, I need those edges! How do I adjust the output format to include them?
This looks like it is an artifact of the way that the data explorer shows and parses the data returned by the underlying engine. Since the edges are not properties of the Vertexes I don't think that these should be included as part of the Vertex returned by the query. If you want to return the vertex and the associated edges you can do that using a query like this which works in the gremlin console and via the driver:
g.V('some-id').as('b').bothE().as('e').select ('b', 'e')

How to search through data with arbitrary amount of fields?

I have the web-form builder for science events. The event moderator creates registration form with arbitrary amount of boolean, integer, enum and text fields.
Created form is used for:
register a new member to event;
search through registered members.
What is the best search tool for second task (to search memebers of event)? Is ElasticSearch well for this task?
I wrote a post about how to index arbitrary data into Elasticsearch and then to search it by specific fields and values. All this, without blowing up your index mapping.
The post is here: http://smnh.me/indexing-and-searching-arbitrary-json-data-using-elasticsearch/
In short, you will need to do the following steps to get what you want:
Create a special index described in the post.
Flatten the data you want to index using the flattenData function:
https://gist.github.com/smnh/30f96028511e1440b7b02ea559858af4.
Create a document with the original and flattened data and index it into Elasticsearch:
{
"data": { ... },
"flatData": [ ... ]
}
Optional: use Elasticsearch aggregations to find which fields and types have been indexed.
Execute queries on the flatData object to find what you need.
Example
Basing on your original question, let's assume that the first event moderator created a form with following fields to register members for the science event:
name string
age long
sex long - 0 for male, 1 for female
In addition to this data, the related event probably has some sort of id, let's call it eventId. So the final document could look like this:
{
"eventId": "2T73ZT1R463DJNWE36IA8FEN",
"name": "Bob",
"age": 22,
"sex": 0
}
Now, before we index this document, we will flatten it using the flattenData function:
flattenData(document);
This will produce the following array:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "2T73ZT1R463DJNWE36IA8FEN"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Bob"
},
{
"key": "age",
"type": "long",
"key_type": "age.long",
"value_long": 22
},
{
"key": "sex",
"type": "long",
"key_type": "sex.long",
"value_long": 0
}
]
Then we will wrap this data in a document as I've showed before and index it.
Then, the second event moderator, creates another form having a new field, field with same name and type, and also a field with same name but with different type:
name string
city string
sex string - "male" or "female"
This event moderator decided that instead of having 0 and 1 for male and female, his form will allow choosing between two strings - "male" and "female".
Let's try to flatten the data submitted by this form:
flattenData({
"eventId": "F1BU9GGK5IX3ZWOLGCE3I5ML",
"name": "Alice",
"city": "New York",
"sex": "female"
});
This will produce the following data:
[
{
"key": "eventId",
"type": "string",
"key_type": "eventId.string",
"value_string": "F1BU9GGK5IX3ZWOLGCE3I5ML"
},
{
"key": "name",
"type": "string",
"key_type": "name.string",
"value_string": "Alice"
},
{
"key": "city",
"type": "string",
"key_type": "city.string",
"value_string": "New York"
},
{
"key": "sex",
"type": "string",
"key_type": "sex.string",
"value_string": "female"
}
]
Then, after wrapping the flattened data in a document and indexing it into Elasticsearch we can execute complicated queries.
For example, to find members named "Bob" registered for the event with ID 2T73ZT1R463DJNWE36IA8FEN we can execute the following query:
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "eventId"}},
{"match": {"flatData.value_string.keyword": "2T73ZT1R463DJNWE36IA8FEN"}}
]
}
}
}
},
{
"nested": {
"path": "flatData",
"query": {
"bool": {
"must": [
{"term": {"flatData.key": "name"}},
{"match": {"flatData.value_string": "bob"}}
]
}
}
}
}
]
}
}
}
ElasticSearch automatically detects the field content in order to index it correctly, even if the mapping hasn't been defined previously. So, yes : ElasticSearch suits well these cases.
However, you may want to fine tune this behavior, or maybe the default mapping applied by ElasticSearch doesn't correspond to what you need : in this case, take a look at the default mapping or, for even further control, the dynamic templates feature.
If you let your end users decide the keys you store things in, you'll have an ever-growing mapping and cluster state, which is problematic.
This case and a suggested solution is covered in this article on common problems with Elasticsearch.
Essentially, you want to have everything that can possibly be user-defined as a value. Using nested documents, you can have a key-field and differently mapped value fields to achieve pretty much the same.

Resources