Explode Array-Nested Array Spark SQL - apache-spark

I have a JSON structure with multiple nested arrays, like this:
{
"id": 10,
"packages": [
{
"packageId": 1010,
"clusters": [
{
"fieldClusterId": 101010,
"fieldDefinitions": [
{
"fieldId": 101011112999,
"fieldName": "EntityId"
}
]
}
]
}
]
}
I'm using spark sql to flatten the array to something like this:
id
packageId
fieldClusterId
fieldId
fieldName
10
1010
101010
101011112999
EntityId
The query ends up being a fairly ugly spark-sql cte with multiple steps:
%sql
with cte as(
select
id
explode(packages) as packages_exploded
from temp),
cte2 as (
select
id,
packages_exploded.packageId,
explode(packages_exploded.clusters) as col
from cte),
cte3 as (
select
id,
packageId,
col.fieldClusterId
explode(col.fieldDefinitions) as col
from cte2)
select
productId,
productName,
fieldClusterId,
fieldClusterName,
col.*
from cte3
Is there a nicer syntax to accomplish this multiple level explosion?

This is the way I would implement:
SELECT id,
packageId,
fieldClusterId,
inline(fieldDefinitions)
FROM
(SELECT id,
packageId,
inline(clusters)
FROM
(SELECT id,
inline(packages)
FROM TABLE_NAME))

Related

how to query many-to-many relation with Typeorm Query Builder using subquery

I'm new to typeorm. I have simple entities : Course and User. I have a third table table CourseUserMapper having course_id and user_id referencing to Course and User tables.
i've created subquery selecting user -> name, email, id
const totalInstructors = this.dataSource
.createQueryBuilder(User, 'u')
.select([
'u.name AS user_name',
'u.email AS user_email',
'cs.userId AS user_id',
'cs.courseId AS courseId',
])
.leftJoin(CourseInstructor, 'cs', 'cs.userId = u.id');
return await this.dataSource
.createQueryBuilder(Course, 'course')
.leftJoinAndMapMany(
'course.users',
'(' + totalInstructors.getQuery() + ')',
'instructors',
'instructors.courseId = course.id',
)
.where('course.id = :id', { id: course.id })
.getMany();
but the above query doesn't return users array list
Expected Result:
[
{
"id": 46,
"title": "course 1",
"description": "this is test course",
"users": [
{
"id": 9,
"name":"John Doe",
"email": "john#email.com"
},
{
"id": 10,
"name":"Anna",
"email":"anna#email.com"
}
]
}
]
Tried working with SQL, which seems to return user_name, user_email and user_id
SELECT
`course`.`id` AS `course_id`,
`course`.`title` AS `course_title`,
`course`.`description` AS `course_description`,
`instructors`.*
FROM `course` `course`
LEFT JOIN `chapter` `chapter`
ON `chapter`.`course_id` = `course`.`id`
LEFT JOIN (
SELECT
`u`.`name` AS user_name,
`u`.`email` AS user_email,
`cs`.`user_id` AS user_id,
`cs`.`course_id` AS course_id
FROM `users` `u`
LEFT JOIN `course_users` `cs`
ON `cs`.`user_id` = `u`.`id`
) `instructors`
ON instructors.course_id = `course`.`id`
WHERE ( `course`.`id` = 40 )
Can i know what i'm doing wrong with Typeorm.
Thanks in advance!!

select non unique record based on a unique combination of 2 fields cosmos db

I have many records in Cosmos DB container with the following structure (sample):
{
"id": "aaaa",
"itemCode": "1234",
"itemDesc": "TEST",
"otherfileds": ""
}
{
"id": "bbbb",
"itemCode": "1234",
"itemDesc": "TEST2",
"otherfileds": ""
}
{
"id": "cccc",
"itemCode": "5678",
"itemDesc": "HELLO",
"otherfileds": ""
}
{
"id": "dddd",
"itemCode": "5678",
"itemDesc": "HELLO",
"otherfileds": ""
}
{
"id": "eeee",
"itemCode": "9012",
"itemDesc": "WORLD",
"otherfileds": ""
}
{
"id": "ffff",
"itemCode": "9012",
"itemDesc": "WORLD",
"otherfileds": ""
}
Now I want to select records from this where an item code have a non distinct item description. Based on the above example records, I would like to return item code 1234 since it has different values of item descriptions in other records.
{
"id": "aaaa",
"itemCode": "1234",
"itemDesc": "TEST",
"otherfileds": ""
}
{
"id": "bbbb",
"itemCode": "1234",
"itemDesc": "TEST2",
"otherfileds": ""
}
I have tried the below query, but realised, it will return the duplicate entries which has same item code and description only.
select count(1) from (select distinct value d.itemCode FROM (SELECT
c.itemCode, c.itemDesc, COUNT(1) as dupcount
FROM
c where c.itemCode<>null
GROUP BY
c.itemCode, c.itemDesc) d where d.dupcount>1 )
But I need to find records where the same item code is having different item descriptions (the query above will return only records which has more than one occurrence of item code/descriptions, ie, item code 9012 and 5678)
EDIT
I think i managed to form the query to filter these results by 2 sub queries (I think this could be improved though).
select e.itemCode from (select d.itemCode, count(1) as dupcount FROM
(SELECT
c.itemCode, c.itemDesc
FROM
c where c.itemCode<>null
GROUP BY
c.itemCode, c.itemDesc) d group by d.itemCode )e where e.dupcount>1
I think I managed to form the query to filter these results by 2 sub-queries (I think this could be improved though).
select distinct e.itemCode from (select d.itemCode, count(1) as dupcount FROM
(SELECT
c.itemCode, c.itemDesc
FROM
c where c.itemCode<>null
GROUP BY
c.itemCode, c.itemDesc) d group by d.itemCode )e where e.dupcount>1

Cosmos DB Query Array value using SQL

I have several JSON files with below structure in my cosmos DB.
[
{
"USA": {
"Applicable": "Yes",
"Location": {
"City": [
"San Jose",
"San Diego"
]
}
}
}]
I want to query all the results/files that has the array value of city = "San Diego".
I've tried the below sql queries
SELECT DISTINCT *
FROM c["USA"]["Location"]
WHERE ["City"] IN ('San Diego')
SELECT DISTINCT *
FROM c["USA"]["Location"]
WHERE ["City"] = 'San Diego'
SELECT c
FROM c JOIN d IN c["USA"]["Location"]
WHERE d["City"] = 'San Diego'
I'm getting the results as 0 - 0
You need to query data from your entire document, where your USA.Location.City array contains an item. For example:
SELECT *
FROM c
WHERE ARRAY_CONTAINS (c.USA.Location.City, "San Jose")
This will give you what you're trying to achieve.
Note: You have a slight anti-pattern in your schema, using "USA" as the key, which means you can't easily query all the location names. You should replace this with something like:
{
"Country": "USA",
"CountryMetadata": {
"Applicable": "Yes",
"Location": {
"City": [
"San Jose",
"San Diego"
]
}
}
}
This lets you query all the different countries. And the query above would then need only a slight change:
SELECT *
FROM c
WHERE c.Country = "USA
AND ARRAY_CONTAINS (c.CountryMetadata.Location.City, "San Jose")
Note that the query now works for any country, and you can pass in country value as a parameter (vs needing to hardcode the country name into the query because it's an actual key name).
Tl;dr don't put values as your keys.

Cosmos SQL where sessions[any].venues[any].id = this

Using this structure as an example, saved in a Cosmos database (course-database) in a collection (course-collection):
{
"courseId": "courseId",
"sessions": [
{
"sessionId": "sessionId1",
"venues": [
{
"id": "venueId1"
},
{
"id": "venueId2"
}
]
},
{
"sessionId": "sessionId2",
"venues": [
{
"id": "venueId3"
},
{
"id": "venueId4"
}
]
}
]
}
How do you create the SQL:
Count the total number of courses, where a course has at least one session, which has at least one venue, which has an ID equals to e.g. venueId3
I've got this so far, but it restricts to the first item of the list, as opposed to just any:
SELECT COUNT(c.id) FROM c WHERE c.sessions[0].venues[0].id = "id"
The answer was join:
SELECT COUNT(c.id)
FROM c
JOIN s in c.sessions
JOIN v in s.venues
WHERE CONTAINS(v.id,"venueId3")
You would then add a new join the deeper in the JSON you would want to go e.g. if venues had an array of contacts:
SELECT COUNT(c.id)
FROM c
JOIN s in c.sessions
JOIN v in s.venues
JOIN co in v.contacts
WHERE CONTAINS(co.id,"contactId")

Azure Data Factory complex JSON source (nested arrays) to Azure Sql Database?

I have a JSON source document that will be uploaded to Azure blob storage regularly. The customer wants to have this input written to Azure Sql Database using Azure Data Factory. The JSON is however complex with many nested arrays and so far I have not be able to find a way to flatten the document. Perhaps this is not supported/possible?
[
{
"ActivityId": 1,
"Header": {},
"Body": [{
"1stSubArray": [{
"Id": 456,
"2ndSubArray": [{
"Id": "abc",
"Descript": "text",
"3rdSubArray": [{
"Id": "def",
"morefields": "text"
},
{
"Id": "ghi",
"morefields": "sample"
}]
}]
}]
}]
}
]
I need to flatten it:
ActivityId, Id, Id, Descript, Id, morefields
1, 456, abc, text1, def, text
1, 456, abc, text2, ghi, sample
1, 456, xyz, text3, jkl, textother
1, 456, xyz, text4, mno, moretext
There could be 8+ flat records per ActivityId. Anyone out there that has seen this and found a way to resolve using Azure Data Factory Copy Data?
Azure SQL Database has some capable JSON shredding abilities including OPENJSON which shreds JSON, and JSON_VALUE which returns scalar values from JSON. Being as you already have Azure SQL DB in your architecture, it would make sense to use it rather than add additional components.
So why not adopt an ELT pattern where you use Data Factory to insert the JSON into a table in Azure SQL DB and then call a stored procedure task to shred it? Some sample SQL based on your example:
DECLARE #json NVARCHAR(MAX) = '[
{
"ActivityId": 1,
"Header": {},
"Body": [
{
"1stSubArray": [
{
"Id": 456,
"2ndSubArray": [
{
"Id": "abc",
"Descript": "text",
"3rdSubArray": [
{
"Id": "def",
"morefields": "text"
},
{
"Id": "ghi",
"morefields": "sample"
}
]
},
{
"Id": "xyz",
"Descript": "text",
"3rdSubArray": [
{
"Id": "jkl",
"morefields": "textother"
},
{
"Id": "mno",
"morefields": "moretext"
}
]
}
]
}
]
}
]
}
]'
--SELECT #json j
-- INSERT INTO yourTable ( ...
SELECT
JSON_VALUE ( j.[value], '$.ActivityId' ) AS ActivityId,
JSON_VALUE ( a1.[value], '$.Id' ) AS Id1,
JSON_VALUE ( a2.[value], '$.Id' ) AS Id2,
JSON_VALUE ( a2.[value], '$.Descript' ) AS Descript,
JSON_VALUE ( a3.[value], '$.Id' ) AS Id3,
JSON_VALUE ( a3.[value], '$.morefields' ) AS morefields
FROM OPENJSON( #json ) j
CROSS APPLY OPENJSON ( j.[value], '$."Body"' ) AS m
CROSS APPLY OPENJSON ( m.[value], '$."1stSubArray"' ) AS a1
CROSS APPLY OPENJSON ( a1.[value], '$."2ndSubArray"' ) AS a2
CROSS APPLY OPENJSON ( a2.[value], '$."3rdSubArray"' ) AS a3;
As you can see, I've used CROSS APPLY to navigate multiple levels. My results:
In the past,you could follow this blog and my previous case:Loosing data from Source to Sink in Copy Data to set Cross-apply nested JSON array option in Blob Storage Dataset. However,it disappears now.
Instead,Collection Reference is applied for array items schema mapping in copy activity.
But based on my test,only one array can be flattened in a schema. Multiple arrays can be referenced—returned as one row containing all of the elements in the array. However, only one array can have each of its elements returned as individual rows. This is the current limitation with jsonPath settings.
As workaround,you can first convert json file with nested objects into CSV file using Logic App and then you can use the CSV file as input for Azure Data factory. Please refer this doc to understand how Logic App can be used to convert nested objects in json file to CSV. Surely,you could also make some efforts on the sql database side,such as SP which is mentioned in the comment by #GregGalloway.
Just for summary,unfortunately,the "Collection reference" only works for one level down in the array structure which is not suitable for #Emrikol. Finally,#Emrikol abandoned Data Factory and has built an app to the work.

Resources