Extract properties from multiple JSON arrays using Jolt transformation - transform

My JSON object looks like following:
{
"array1": [
{
"key1": "value1", // common key
"key2": "value2",
"key3": "value3"
},
{
"key1": "value1", // common key
"key2": "value2",
"key3": "value3"
}
],
"includes": {
"array2": [
{
"key1": "value1", // common key
"key4": "value4",
"key5": "value5"
},
{
"key1": "value1",
"key4": "value4",
"key5": "value5"
}
]
}
}
I need to have the output in following format -
[
{
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4", // this comes from joining with array 2 based on key1
"key5": "value5" // this comes from joining with array 2 based on key1
},
{
"key1": "value1",
"key2": "value2",
"key3": "value3",
"key4": "value4", // this comes from joining with array 2 based on key1
"key5": "value5" // this comes from joining with array 2 based on key1
}
]
I only have a solution to fetch fields from array1 but unsure how to join with array2 based on common key, fetch required fields and represent them in a desired way.
Current Transformation :
[
{
"operation": "shift",
"spec": {
"data": {
"*": {
"key1": "[&1].key1",
"key2": "[&1].key2",
"key3": "[&1].key3"
}
}
}
}
]
Current undesired output :
[
{
"key1" : "value1",
"key2" : "value2",
"key3" : "value3"
},
{
"key1" : "value1",
"key2" : "value2",
"key3" : "value3"
}
]
Any help would be appreciated here. Thank you!

First of all, in order to get "the undesired output", need to replace "data" with "*" wildcard within the current transformation spec, and no need to repeat each attribute key name and value branch, only using this spec is enough
[
{
"operation": "shift",
"spec": {
"*": {
"*": {
"*": "[&1].&"
}
}
}
}
]
if you'd nest one more level as
[
{
"operation": "shift",
"spec": {
"*": {
"*": {
"*": {
"*": "[&1].&"
}
}
}
}
}
]
then, you'd get
[
{
"key1" : "value1",
"key4" : "value4",
"key5" : "value5"
},
{
"key1" : "value1",
"key4" : "value4",
"key5" : "value5"
}
]
We can use "*" and "#" wildcards at different levels of objects in order to combine those results, but this case the values of key name "key1" would repeat of course. We can get rid of that repeating by adding a cardinality transformation , and get your desired result such as
[
{
"operation": "shift",
"spec": {
"*": {
"*": {
"*": {
"*": "[&1].&"
},
"#": "[&1]"
}
}
}
},
{
"operation": "cardinality",
"spec": {
"*": {
"*": "ONE"
}
}
}
]
the demo on the site http://jolt-demo.appspot.com/ is :

Related

Using Jolt Spec how to reverse reduce a list of dictionary by a key using

Using the following code I was able to map a list of dictionaries by a key
import json
values_list = [{"id" : 1, "user":"Rick", "title":"More JQ"}, {"id" : 2, "user":"Steve", "title":"Beyond"}, {"id" : 1, "user":"Rick", "title":"Winning"}]
result = {}
for data in values_list:
id = data['id']
user = data['user']
title = data['title']
if id not in result:
result[id] = {
'id' : id,
'user' : user,
'books' : {'titles' : []}
}
result[id]['books']['titles'].append(title)
print(json.dumps((list(result.values())), indent=4))
Knowing how clean is Jolt Spec and trying to separate the schema outside of the code.
Is there a way to use Jolt Spec to achieve the same result.
The Result
[
{
"id": 1,
"user": "Rick",
"books": {
"titles": [
"More JQ",
"Winning"
]
}
},
{
"id": 2,
"user": "Steve",
"books": {
"titles": [
"Beyond"
]
}
}
]
You can use three levels of consecutive specs
[
{
"operation": "shift",
"spec": {
"*": {
"*": "#(1,id).&",
"title": "#(1,id).books.&s[]"
}
}
},
{
"operation": "shift",
"spec": {
"*": ""
}
},
{
"operation": "cardinality",
"spec": {
"*": {
"id": "ONE",
"user": "ONE"
}
}
}
]
in the first spec, the common id values are combined by "#(1,id)." expression
in the second spec, the integer keys(1,2) of the outermost objects are removed
in the last spec,only the first of the repeating elements are picked

Jolt merge array values from objects in one array

I have the following Json array:
[ {
"name" : [ "roger", "roger" ],
"state" : [ "primary", "quality" ],
"value" : [ 1, 2 ]
}, {
"name" : [ "david", "david" ],
"state" : [ "primary", "quality" ],
"value" : [ 4, 5 ]
} ]
and I want to have the following Json object result using Jolt
{
"name" : [ "roger", "roger" , "david", "david" ],
"state" : [ "primary", "quality" ,"primary", "quality" ],
"value" : [ 1, 2 , 4, 5]
}
please someone can help me?
You can apply shift transformation twice such as
[
{
"operation": "shift",
"spec": {
"*": {
"*": "&.&1"
}
}
},
{
"operation": "shift",
"spec": {
"*": {
"0": {
"*": "&2[]"
},
"1": {
"*": "&2[]"
}
}
}
}
]
where determine the keys(&) and respective indexes(&1->0 and 1) by prepending ampersand of keys such as "&.&1" in the first step, then dissipate each respective values through use of "*": "&2[]" in which &2 represents going two levels up in order to traverse two curly braces in order to reach the root key to target the each values of the arrays.

How do I use Jolt to flatten a json array of n objects with the key?

I have a fairly straightforward use case, but I can't seem to wrap my head around the shift specification that would make this transpose possible. It's primarily just flattening the tree hierarchy into simple output arrays.
How would a turn this input JSON:
{
"123": [
{
"VALUE_ONE": "Y",
"VALUE_TWO": "12"
},
{
"VALUE_ONE": "N",
"VALUE_TWO": "2"
}
],
"456": [
{
"VALUE_ONE": "Y",
"VALUE_TWO": "35"
}
]
}
Into this output:
[
{
"value_one_new_name": "Y",
"value_two_new_name": "12",
"key": "123"
},
{
"value_one_new_name": "N",
"value_two_new_name": "2",
"key": "123"
},
{
"value_one_new_name": "Y",
"value_two_new_name": "35",
"key": "456"
}
]
NOTE that I don't know what the key ("456", "123" .. etc) would be for each object, so the jolt spec needs to be generic enough to convert any keys, only known field names are "VALUE_ONE" and "VALUE_TWO".
This steps will do the trick:
[
{
"operation": "shift",
"spec": {
"*": {
"*": {
"VALUE_ONE": "&2.[&1].value_one_new_name",
"VALUE_TWO": "&2.[&1].value_two_new_name",
"$1": "&2.[&1].key"
}
}
}
},
{
"operation": "shift",
"spec": {
"*": {
"*": "[]"
}
}
}
]

Re-map an array of ObjectIds in each item of a Nested Array

I have a single document which has user generated tags and also entries which has an array of tag IDs for each entry (or possibly none):
// Doc (with redacted items I would like to project too)
{
"_id": ObjectId("5ae5afc93e1d0d2965a4f2d7"),
"entries" : [
{
"_id" : ObjectId("5b159ebb0ed51064925dff24"),
// Desired:
// tags: {[
// "_id" : ObjectId("5b142ab7e419614016b8992d"),
// "name" : "Shit",
// "color" : "#95a5a6"
// ]}
"tags" : [
ObjectId("5b142ab7e419614016b8992d")
]
},
],
"tags" : [
{
"_id" : ObjectId("5b142608e419614016b89925"),
"name" : "Outdated",
"color" : "#3498db"
},
{
"_id" : ObjectId("5b142ab7e419614016b8992d"),
"name" : "Shit",
"color" : "#95a5a6"
},
],
}
How can I "fill up" the tag array for each entry with the corresponding value in the tags array? I tried $lookup and aggregate but it was too complicated to get right.
From the looks of your actual data, there is no need to populate() or $lookup here since the data you want to "join" is not only in the same collection but it's actually in the same document. What you want here instead is $map or even Array.map() to simply take values in one array of the document and merge them into the other.
Aggregate $map transform
The basic case of what you need to do here is $map to transform the each array in the output. These are "entries" and within each "entry" transforming the "tags" by matching values to those within the "tags" array of the parent document:
Project.aggregate([
{ "$project": {
"entries": {
"$map": {
"input": "$entries",
"as": "e",
"in": {
"someField": "$$e.someField",
"otherField": "$$e.otherField",
"tags": {
"$map": {
"input": "$$e.tags",
"as": "t",
"in": {
"$arrayElemAt": [
"$tags",
{ "$indexOfArray": [ "$tags._id", "$$t" ] }
]
}
}
}
}
}
}
}}
])
Note there the "someField" and "otherField" as placeholders for fields which "might" be present at that level within each "entry" document of the array. The only catch with $map is that what is specified within the "in" argument is the only output you actually get, so there is a need to explicitly name every single potential field that would be in your "variable keys" structure, and including the "tags".
The counter to this in modern releases since MongoDB 3.6 is to use $mergeObjects instead which allows a "merge" of the "re-mapped" inner array of "tags" into the "entry" document of each array member:
Project.aggregate([
{ "$project": {
"entries": {
"$map": {
"input": "$entries",
"as": "e",
"in": {
"$mergeObjects": [
"$$e",
{ "tags": {
"$map": {
"input": "$$e.tags",
"as": "t",
"in": {
"$arrayElemAt": [
"$tags",
{ "$indexOfArray": [ "$tags._id", "$$t" ] }
]
}
}
}}
]
}
}
}
}}
])
As for the actual $map on the "inner" array of "tags", here you can use the $indexOfArray operator to do a comparison with the "root level" field of "tags" based on where the _id property matches the value of the current entry of this "inner" array. With that "index" returned, the $arrayElemAt operator then "extracts" the actual array entry from that matched "index" position, and transplants the current array entry in the $map with that element.
The only point of care here is in the case where the two arrays in fact do not have matching entries for some reason. If you have already taken care of this, then the code here is fine. If there is a mismatch you might instead need to $filter to match the elements and take the $arrayElemAt at index 0 instead:
"in": {
"$arrayElemAt": [
{ "$filter": {
"input": "$tags",
"cond": { "$eq": [ "$$this._id", "$$t" ] }
}},
0
]
}
The reason being that doing that allows a null where there is no match, but $indexOfArray will return -1, and that used with $arrayElemAt returns the "last" array element. And the "last" element is of course in that scenario not the "matching" result, since there was no match.
Client side transformation
So from the perspective there where you are "only" returning the "entries" content "re-mapped" and discarding the "tags" from the root of the document, the aggregation process where possible is the better option since the server only returns the elements you actually want.
If you cannot do that or otherwise really don't care if the existing "tags" element is also returned, then aggregation transformation is really not necessary here at all. In fact the "server" need not do anything, and probably "should not" considering all the data is already in the document and "additional" transforms is just adding to the document size.
So this is all actually possible to do with the result once returned to the client, and for a simple transformation of the document just the same as was demonstrated with the above aggregation pipeline examples the only code you actually need is:
let results = await Project.find().lean();
results = results.map(({ entries, tags, ...r }) =>
({
...r,
entries: entries.map(({ tags: etags, ...e }) =>
({
...e,
tags: etags.map( tid => tags.find(t => t._id.equals(tid)) )
})
),
// tags
})
);
This gives you exactly the same results and even optionally keep the tags in there by removing the comment. It's even basically "exactly the same process" of using Array.map() on each array in order to do the transformation of each one.
The syntax to "merge" is much more simple with modern JavaScript object spread operations, and overall the language is far less terse. You use Array.find() in order to "lookup" the matching content of the two arrays for tags and the only other thing to be aware of is the ObjectId.equals() method, which is needed to actually compare these two values and built in to the returned types anyway.
Of course since you are "transforming" the documents, in order to make this possible you use lean() on any mongoose operation returning the results to manipulate so the data returned is in fact plain JavaScript objects rather than Mongoose Document types bound to the schema, which is the default return.
Conclusion and Demonstration
The general lesson here is that if you are looking to "reduce data" in the returned response, then the aggregate() method is for you. If however you decide that you want the "whole" document data anyway and just want to "augment" these other array entries in the response, then just take the data back to the "client" and transform it there instead. Ideally as "frontward" as possible considering that "additions" are just adding weight to the payload response in this case.
A full demonstration listing would be:
const { Schema, Types: { ObjectId } } = mongoose = require('mongoose');
const uri = 'mongodb://localhost/test';
mongoose.Promise = global.Promise;
mongoose.set('debug', true);
const tagSchema = new Schema({
name: String,
color: String
});
const projectSchema = new Schema({
entries: [],
tags: [tagSchema]
});
const Project = mongoose.model('Project', projectSchema);
const log = data => console.log(JSON.stringify(data, undefined, 2));
(async function() {
try {
const conn = await mongoose.connect(uri);
let db = conn.connections[0].db;
let { version } = await db.command({ buildInfo: 1 });
version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);
await Promise.all(Object.entries(conn.models).map(([k,m]) => m.remove()));
await Project.insertMany(data);
let pipeline = [
{ "$project": {
"entries": {
"$map": {
"input": "$entries",
"as": "e",
"in": {
"someField": "$$e.someField",
"otherField": "$$e.otherField",
"tags": {
"$map": {
"input": "$$e.tags",
"as": "t",
"in": {
"$arrayElemAt": [
"$tags",
{ "$indexOfArray": [ "$tags._id", "$$t" ] }
]
}
}
}
}
}
}
}}
];
let other = [
{
...(({ $project: { entries: { $map: { input, as, ...o } } } }) =>
({
$project: {
entries: {
$map: {
input,
as,
in: {
"$mergeObjects": [ "$$e", { tags: o.in.tags } ]
}
}
}
}
})
)(pipeline[0])
}
];
let tests = [
{ name: 'Standard $project $map', pipeline },
...(version >= 3.6) ?
[{ name: 'With $mergeObjects', pipeline: other }] : []
];
for ( let { name, pipeline } of tests ) {
let results = await Project.aggregate(pipeline);
log({ name, results });
}
// Client Manipulation
let results = await Project.find().lean();
results = results.map(({ entries, tags, ...r }) =>
({
...r,
entries: entries.map(({ tags: etags, ...e }) =>
({
...e,
tags: etags.map( tid => tags.find(t => t._id.equals(tid)) )
})
)
})
);
log({ name: 'Client re-map', results });
mongoose.disconnect();
} catch(e) {
console.error(e)
} finally {
process.exit()
}
})();
// Data
const data =[
{
"_id": ObjectId("5ae5afc93e1d0d2965a4f2d7"),
"entries" : [
{
"_id" : ObjectId("5b159ebb0ed51064925dff24"),
"someField": "someData",
"tags" : [
ObjectId("5b142ab7e419614016b8992d")
]
},
],
"tags" : [
{
"_id" : ObjectId("5b142608e419614016b89925"),
"name" : "Outdated",
"color" : "#3498db"
},
{
"_id" : ObjectId("5b142ab7e419614016b8992d"),
"name" : "Shitake",
"color" : "#95a5a6"
},
]
},
{
"_id": ObjectId("5b1b1ad07325c4c541e8a972"),
"entries" : [
{
"_id" : ObjectId("5b1b1b267325c4c541e8a973"),
"otherField": "otherData",
"tags" : [
ObjectId("5b142608e419614016b89925"),
ObjectId("5b142ab7e419614016b8992d")
]
},
],
"tags" : [
{
"_id" : ObjectId("5b142608e419614016b89925"),
"name" : "Outdated",
"color" : "#3498db"
},
{
"_id" : ObjectId("5b142ab7e419614016b8992d"),
"name" : "Shitake",
"color" : "#95a5a6"
},
]
}
];
And this would give full output ( with the optional output from a supporting MongoDB 3.6 instance ) as:
Mongoose: projects.remove({}, {})
Mongoose: projects.insertMany([ { entries: [ { _id: 5b159ebb0ed51064925dff24, someField: 'someData', tags: [ 5b142ab7e419614016b8992d ] } ], _id: 5ae5afc93e1d0d2965a4f2d7, tags: [ { _id: 5b142608e419614016b89925, name: 'Outdated', color: '#3498db' }, { _id: 5b142ab7e419614016b8992d, name: 'Shitake', color: '#95a5a6' } ], __v: 0 }, { entries: [ { _id: 5b1b1b267325c4c541e8a973, otherField: 'otherData', tags: [ 5b142608e419614016b89925, 5b142ab7e419614016b8992d ] } ], _id: 5b1b1ad07325c4c541e8a972, tags: [ { _id: 5b142608e419614016b89925, name: 'Outdated', color: '#3498db' }, { _id: 5b142ab7e419614016b8992d, name: 'Shitake', color: '#95a5a6' } ], __v: 0 } ], {})
Mongoose: projects.aggregate([ { '$project': { entries: { '$map': { input: '$entries', as: 'e', in: { someField: '$$e.someField', otherField: '$$e.otherField', tags: { '$map': { input: '$$e.tags', as: 't', in: { '$arrayElemAt': [ '$tags', { '$indexOfArray': [Array] } ] } } } } } } } } ], {})
{
"name": "Standard $project $map",
"results": [
{
"_id": "5ae5afc93e1d0d2965a4f2d7",
"entries": [
{
"someField": "someData",
"tags": [
{
"_id": "5b142ab7e419614016b8992d",
"name": "Shitake",
"color": "#95a5a6"
}
]
}
]
},
{
"_id": "5b1b1ad07325c4c541e8a972",
"entries": [
{
"otherField": "otherData",
"tags": [
{
"_id": "5b142608e419614016b89925",
"name": "Outdated",
"color": "#3498db"
},
{
"_id": "5b142ab7e419614016b8992d",
"name": "Shitake",
"color": "#95a5a6"
}
]
}
]
}
]
}
Mongoose: projects.aggregate([ { '$project': { entries: { '$map': { input: '$entries', as: 'e', in: { '$mergeObjects': [ '$$e', { tags: { '$map': { input: '$$e.tags', as: 't', in: { '$arrayElemAt': [Array] } } } } ] } } } } } ], {})
{
"name": "With $mergeObjects",
"results": [
{
"_id": "5ae5afc93e1d0d2965a4f2d7",
"entries": [
{
"_id": "5b159ebb0ed51064925dff24",
"someField": "someData",
"tags": [
{
"_id": "5b142ab7e419614016b8992d",
"name": "Shitake",
"color": "#95a5a6"
}
]
}
]
},
{
"_id": "5b1b1ad07325c4c541e8a972",
"entries": [
{
"_id": "5b1b1b267325c4c541e8a973",
"otherField": "otherData",
"tags": [
{
"_id": "5b142608e419614016b89925",
"name": "Outdated",
"color": "#3498db"
},
{
"_id": "5b142ab7e419614016b8992d",
"name": "Shitake",
"color": "#95a5a6"
}
]
}
]
}
]
}
Mongoose: projects.find({}, { fields: {} })
{
"name": "Client re-map",
"results": [
{
"_id": "5ae5afc93e1d0d2965a4f2d7",
"__v": 0,
"entries": [
{
"_id": "5b159ebb0ed51064925dff24",
"someField": "someData",
"tags": [
{
"_id": "5b142ab7e419614016b8992d",
"name": "Shitake",
"color": "#95a5a6"
}
]
}
]
},
{
"_id": "5b1b1ad07325c4c541e8a972",
"__v": 0,
"entries": [
{
"_id": "5b1b1b267325c4c541e8a973",
"otherField": "otherData",
"tags": [
{
"_id": "5b142608e419614016b89925",
"name": "Outdated",
"color": "#3498db"
},
{
"_id": "5b142ab7e419614016b8992d",
"name": "Shitake",
"color": "#95a5a6"
}
]
}
]
}
]
}
Note this includes some additional data to demonstrate the projection of "variable fields".

Databricks get JSON without schema

What's the typical approach for getting JSON from REST API using databricks?
It returns nested structure, which can change over time and doesn't have any schema:
{ "page": "1",
"total": "10",
"payload": [
{ "param1": "value1",
"param2": "value2"
},
{ "param2": "value2",
"param3": "value3"
}
]
}
I'm trying to put it into dataframe.

Resources