For example, I have one full set of nested JSON, I need to update this JSON with the latest values from another nested JSON.
Can anyone help me with this?
I want to implement this in Pyspark.
Full Set Json look like this:
{
"email": "abctest#xxx.com",
"firstName": "name01",
"id": 6304,
"surname": "Optional",
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
LatestJson look like this:
{
"email": "test#xxx.com",
"firstName": "name01",
"surname": "Optional",
"id": 6304,
"layer01": {
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4",
"layer02": {
"key1": "value1_changedData",
"key2": "value2"
},
"layer03": [
{
"inner_key01": "inner value01"
},
{
"inner_key02": "inner_value02"
}
]
},
"surname": "Required only$uid"
}
In above for id=6304 we have received updates for the layer01.layer02.key1 and emailaddress fileds.
So I need to update these values to full JSON, Kindly help me with this.
You can load the 2 JSON files into Spark data frames and do a left_join to get updates from the latest JSON data :
from pyspark.sql import functions as F
full_json_df = spark.read.json(full_json_path, multiLine=True)
latest_json_df = spark.read.json(latest_json_path, multiLine=True)
updated_df = full_json_df.alias("full").join(
latest_json_df.alias("latest"),
F.col("full.id") == F.col("latest.id"),
"left"
).select(
F.col("full.id"),
*[
F.when(F.col("latest.id").isNotNull(), F.col(f"latest.{c}")).otherwise(F.col(f"full.{c}")).alias(c)
for c in full_json_df.columns if c != 'id'
]
)
updated_df.show(truncate=False)
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|id |email |firstName|layer01 |surname |
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
#|6304|test#xxx.com|name01 |[value1, value2, value3, value4, [value1_changedData, value2], [[inner value01,], [, inner_value02]]]|Optional|
#+----+------------+---------+-----------------------------------------------------------------------------------------------------+--------+
Update:
If the schema changes between full and latest JSONs, you can load the 2 files into the same data frame (this way the schemas are being merged) and then deduplicate per id:
from pyspark.sql import Window
from pyspark.sql import functions as F
merged_json_df = spark.read.json("/path/to/{full_json.json,latest_json.json}", multiLine=True)
# order priority: latest file then full
w = Window.partitionBy(F.col("id")).orderBy(F.when(F.input_file_name().like('%latest%'), 0).otherwise(1))
updated_df = merged_json_df.withColumn("rn", F.row_number().over(w))\
.filter("rn = 1")\
.drop("rn")
updated_df.show(truncate=False)
Related
Let's suppose I have these two documents :
{
"type": "ip",
"_id": "321",
"key1": "10",
"key2": "20",
"ip_config": {
"ip": "127.0.0.1",
"connexion": "WIFI"
}
}
{
"type": "device",
"_id": "1",
"key1": "10",
"key2": "20",
"device": {
"port": "8808",
"bits": 46
}
}
I want to generate a view in CouuchDB that gives me the following output :
{
"key1": "10",
"key2": "20",
"ip_config": {
"port": "8808",
"bits": 46
},
"device": {
"port": "8808",
"bits": 46
}
}
What is the map function that can help me get this output ?
As #RamblinRose points out, you cannot "join" documents with a view. The only thing you can do is emit the keys that are common between the docs (in this case it looks like key1 and key2 identify this relationship).
So if you had a database called devices and created a design document called test with a view called device-view with a map function:
function (doc) {
emit([doc.key1, doc.key2], null);
}
Then you would be able to obtain all the documents related to the combination of key1 and key2 with:
https://host/devices/_design/test/_view/device-view?include_docs=true&key=[%2210%22,%2220%22]
This would give you:
{"total_rows":2,"offset":0,"rows":[
{"id":"1","key":["10","20"],"value":null,"doc":{"_id":"1","_rev":"1-630408a91350426758c0932ea109f4d5","type":"device","key1":"10","key2":"20","device":{"port":"8808","bits":46}}},
{"id":"321","key":["10","20"],"value":null,"doc":{"_id":"321","_rev":"1-09d9a676c37f17c04a2475492995fade","type":"ip","key1":"10","key2":"20","ip_config":{"ip":"127.0.0.1","connexion":"WIFI"}}}
]}
This doesn't do the join, so you would have to process them to obtain a single document.
I'm constructing a dataframe from a JSON file and saving this dataframe to a parquet file. This parquet file is consumed by a PIG script for further processing.
Below is the schema of the JSON file:
{id:"1",
name:"test",
"fields": [
{
"fieldId": "ABC1.0",
"values": [
{
"key": "812320",
"formId": 11100,
"occ": 1,
"attachId": 0
}
]
},
{
"fieldId": "CDE2.0",
"values": [
{
"key": "MA",
"formId": 11100,
"occ": 1,
"attachId": 0
},
{
"key": 23.0,
"formId": 11100,
"occ": 1,
"attachId": 0
}
]
}
]
}
I need to set the data type of the "key" field based on its value. The value of the key could be string, long double, integer.
How Can I achieve this using spark dataframe/dataset.
First of all your content in the json file is wrong. All properties need to be enclosed in double quotes :-
I have the below json file, where i need to filter the City data based on flag value equals to true
"attributes": { "State":
[ { "type": "sc", "ov": true, "value": "TN" } ],
"City": [ { "type": "c", "flag": true, "value": "Chennai" },
{ "type": "c", "flag": false, "value": "Coimbatore" } ],
}
Expecting the output as below
State: TN
City: Chennai
you can write something like below to only filter city based on flag is True.
import json
data = json.loads(open("/home/imtiaz/tmp/data1.json").read())
data1 = [city for city in data['attributes']['City'] if city['flag'] is True]
data['attributes']['City'] = data1
Just load the json file into memory, then use a list comprehension to filter for where the flag is true.
import json
with open('yourfile.json', 'r') as citydata:
cities_dict = json.load(citydata)
true_cities = [city for city in cities_dict['attributes']['City'] if city['flag']]
This won't mutate the original data, and will return a separate list of cities where the flag is true. You can just set the same list to the list comprehension's return value to mutate the original data in memory, such as:
cities_dict['attributes']['City'] = [city for city in cities_dict['attributes']['City'] if city['flag']]
I have a json msg coming from iotHub like:
{
"deviceId": "abc",
"topic": "data",
"data": {
"varname1": [{
"t": "timestamp1",
"v": "value1",
"f": "respondFrame1"
},
{
"t": "timestamp2",
"v": "value2",
"f": "respondFrame2"
}],
"varname2": [{
"t": "timestamp1",
"v": "value1",
"f": "respondFrame1"
},
{
"t": "timestamp2",
"v": "value2",
"f": "respondFrame2"
}]
}
}
and want to store this by azure stream analytics job into a transact sql like this:
ID | deviceId | varname | timestamp | respondFrame | value
-----+------------+-----------+-------------+----------------+--------
1 | abc | varname1 | timestamp1 | respondFrame1 | value1
2 | abc | varname1 | timestamp2 | respondFrame2 | value2
3 | abc | varname2 | timestamp1 | respondFrame1 | value1
4 | abc | varname2 | timestamp2 | respondFrame2 | value2
does anaybody knwo how to handle this stacked iterations and combine it (cross apply)?
something like this "phantomCode":
deviceId = msg.deviceId
for d in msg.data:
for key in d:
varname = key.name
timestamp = key[varname].t
frame = key[varname].f
value = key[varname].v
UPDATE regarding to JS Azure answer:
with the code
WITH datalist AS
(
SELECT
iotHubAlias.deviceId,
data.PropertyName as varname,
data.PropertyValue as arrayData
FROM [iotHub] as iotHubAlias
CROSS APPLY GetRecordProperties(iotHubAlias.data) AS data
WHERE iotHubAlias.topic = 'data'
)
SELECT
datalist.deviceId,
datalist.varname,
arrayElement.ArrayValue.t as [timestamp],
arrayElement.ArrayValue.f as respondFrame,
arrayElement.ArrayValue.v as value
INTO [temporary]
FROM datalist
CROSS APPLY GetArrayElements(datalist.arrayData) AS arrayElement
I always get an error:
{
"channels": "Operation",
"correlationId": "f9d4437b-707e-4892-a37b-8ad721eb1bb2",
"description": "",
"eventDataId": "ef5a5f2b-8c2f-49c2-91f0-16213aaa959d",
"eventName": {
"value": "streamingNode0",
"localizedValue": "streamingNode0"
},
"category": {
"value": "Administrative",
"localizedValue": "Administrative"
},
"eventTimestamp": "2018-08-21T18:23:39.1804989Z",
"id": "/subscriptions/46cd2f8f-b46b-4428-8f7b-c7d942ff745d/resourceGroups/fieldtest/providers/Microsoft.StreamAnalytics/streamingjobs/streamAnalytics4fieldtest/events/ef5a5f2b-8c2f-49c2-91f0-16213aaa959d/ticks/636704726191804989",
"level": "Error",
"operationId": "7a38a957-1a51-4da1-a679-eae1c7e3a65b",
"operationName": {
"value": "Process Events: Processing events Runtime Error",
"localizedValue": "Process Events: Processing events Runtime Error"
},
"resourceGroupName": "fieldtest",
"resourceProviderName": {
"value": "Microsoft.StreamAnalytics",
"localizedValue": "Microsoft.StreamAnalytics"
},
"resourceType": {
"value": "Microsoft.StreamAnalytics/streamingjobs",
"localizedValue": "Microsoft.StreamAnalytics/streamingjobs"
},
"resourceId": "/subscriptions/46cd2f8f-b46b-4428-8f7b-c7d942ff745d/resourceGroups/fieldtest/providers/Microsoft.StreamAnalytics/streamingjobs/streamAnalytics4fieldtest",
"status": {
"value": "Failed",
"localizedValue": "Failed"
},
"subStatus": {
"value": "",
"localizedValue": ""
},
"submissionTimestamp": "2018-08-21T18:24:34.0981187Z",
"subscriptionId": "46cd2f8f-b46b-4428-8f7b-c7d942ff745d",
"properties": {
"Message Time": "2018-08-21 18:23:39Z",
"Error": "- Unable to cast object of type 'Microsoft.EventProcessing.RuntimeTypes.ValueArray' to type 'Microsoft.EventProcessing.RuntimeTypes.IRecord'.\r\n",
"Message": "Runtime exception occurred while processing events, - Unable to cast object of type 'Microsoft.EventProcessing.RuntimeTypes.ValueArray' to type 'Microsoft.EventProcessing.RuntimeTypes.IRecord'.\r\n, : OutputSourceAlias:temporary;",
"Type": "SqlRuntimeError",
"Correlation ID": "f9d4437b-707e-4892-a37b-8ad721eb1bb2"
},
"relatedEvents": []
}
and here an example of a real json msg coming from a device:
{
"topic": "data",
"data": {
"ExternalFlowTemperatureSensor": [{
"t": "2018-08-22T11:00:11.955381",
"v": 16.64103,
"f": "Q6ES8KJIN1NX2DRGH36RX1WDT"
}],
"AdaStartsP2": [{
"t": "2018-08-22T11:00:12.863383",
"v": 382.363138,
"f": "9IY7B4DFBAMOLH3GNKRUNUQNUX"
},
{
"t": "2018-08-22T11:00:54.172501",
"v": 104.0,
"f": "IUJMP20CYQK60B"
}],
"s_DriftData[4].c32_ZeitLetzterTest": [{
"t": "2018-08-22T11:01:01.829568",
"v": 348.2916,
"f": "MMTPWQVLL02CA"
}]
},
"deviceId": "test_3c27db"
}
and (to have it complete) the creation code for the sql table:
create table temporary (
id int NOT NULL IDENTITY PRIMARY KEY,
deviceId nvarchar(20) NOT NULL,
timestamp datetime NOT NULL,
varname nvarchar(100) NOT NULL,
value float,
respondFrame nvarchar(50)
)
the following query will give you the expected output
WITH step1 AS
(
SELECT
event.deviceID,
data.PropertyName as varname,
data.PropertyValue as arrayData
FROM blobtest as event
CROSS APPLY GetRecordProperties(event.data) AS data
)
SELECT
event.deviceId,
event.varname,
arrayElement.ArrayValue.t as [timestamp],
arrayElement.ArrayValue.f as frame,
arrayElement.ArrayValue.v as value
FROM step1 as event
CROSS APPLY GetArrayElements(event.arrayData) AS arrayElement
You can find more info about JSON parsing on our documentation page "Parse JSON and Avro data in Azure Stream Analytics"
Let me know if you have any other question.
JS (Azure Stream Analytics)
Consider these two sample documents stored in DocumentDB.
Document 1
"JobId": "04e63d1d-2af1-42af-a349-810f55817602",
"JobType": 3,
"
"Properties": {
"Key1": "Value1",
"Key2": "Value2"
}
"KeyNames": ["Key1", "Key2"]
Document 2
"JobId": "04e63d1d-2af1-42af-a349-810f55817603",
"JobType": 4,
"
"Properties": {
"Key3": "Value3",
"Key4": "Value4"
}
"KeyNames": ["Key3", "Key4"]
I want to select the all the keys and all the values in Properties object for each document.
Something like:
SELECT
c.JobId,
c.JobType,
c.Properties.<Keys> AS Keys,
c.Properties.<Values> AS Values
FROM c
But as you can see the keys are not fixed. So how do I list them? So finally I get a result like this. I cannot use .NET or LINQ. I need a query to be executed in the DocumentDB Query Explorer.
[
{
"JobId": "04e63d1d-2af1-42af-a349-810f55817602",
"JobType": 3,
"Key1": "Value1"
}
{
"JobId": "04e63d1d-2af1-42af-a349-810f55817602",
"JobType": 3,
"Key2": "Value2"
}
{
"JobId": "04e63d1d-2af1-42af-a349-810f55817603",
"JobType": 4,
"Key3": "Value3"
}
{
"JobId": "04e63d1d-2af1-42af-a349-810f55817603",
"JobType": 4,
"Key4": "Value4"
}
]
I was able to solve my problem using UDF in DocumentDB. Since KeyNames is an array. Self-join was returning the key.
So this query.
SELECT
c.JobId,
c.JobType,
Key,
udf.GetValueUsingKey(c.Properties, Key) AS Value
FROM collection AS c
JOIN Key in c.KeyNames
returned me the desired result.
You can define UDF by using Script Explorer provided in DocumentDB.
For my purpose I used:
function GetValueUsingKey(Properties, Key) {
var result = Properties[Key];
return JSON.stringify(result);
}
Hope this helps :)