PySpark: How to create a nested JSON from spark data frame? - python-3.x

I am trying to create a nested json from my spark dataframe which has data in following structure. The below code is creating a simple json with key and value. Could you please help
df.coalesce(1).write.format('json').save(data_output_file+"createjson.json", overwrite=True)
Update1:
As per #MaxU answer,I converted the spark data frame to pandas and used group by. It is putting the last two fields in a nested array. How could i first put the category and count in nested array and then inside that array i want to put subcategory and count.
Sample text data:
Vendor_Name,count,Categories,Category_Count,Subcategory,Subcategory_Count
Vendor1,10,Category 1,4,Sub Category 1,1
Vendor1,10,Category 1,4,Sub Category 2,2
Vendor1,10,Category 1,4,Sub Category 3,3
Vendor1,10,Category 1,4,Sub Category 4,4
j = (data_pd.groupby(['vendor_name','vendor_Cnt','Category','Category_cnt'], as_index=False)
.apply(lambda x: x[['Subcategory','subcategory_cnt']].to_dict('r'))
.reset_index()
.rename(columns={0:'subcategories'})
.to_json(orient='records'))
[{
"vendor_name": "Vendor 1",
"count": 10,
"categories": [{
"name": "Category 1",
"count": 4,
"subCategories": [{
"name": "Sub Category 1",
"count": 1
},
{
"name": "Sub Category 2",
"count": 1
},
{
"name": "Sub Category 3",
"count": 1
},
{
"name": "Sub Category 4",
"count": 1
}
]
}]

You need to re-structure the whole dataframe for that.
"subCategories" is a struct stype.
from pyspark.sql import functions as F
df.withColumn(
"subCategories",
F.struct(
F.col("subCategories").alias("name"),
F.col("subcategory_count").alias("count")
)
)
and then, groupBy and use F.collect_list to create the array.
At the end, you need to have only 1 record in your dataframe to get the result you expect.

The easiest way to do this in python/pandas would be to use a series of nested generators using groupby I think:
def split_df(df):
for (vendor, count), df_vendor in df.groupby(["Vendor_Name", "count"]):
yield {
"vendor_name": vendor,
"count": count,
"categories": list(split_category(df_vendor))
}
def split_category(df_vendor):
for (category, count), df_category in df_vendor.groupby(
["Categories", "Category_Count"]
):
yield {
"name": category,
"count": count,
"subCategories": list(split_subcategory(df_category)),
}
def split_subcategory(df_category):
for row in df.itertuples():
yield {"name": row.Subcategory, "count": row.Subcategory_Count}
list(split_df(df))
[
{
"vendor_name": "Vendor1",
"count": 10,
"categories": [
{
"name": "Category 1",
"count": 4,
"subCategories": [
{"name": "Sub Category 1", "count": 1},
{"name": "Sub Category 2", "count": 2},
{"name": "Sub Category 3", "count": 3},
{"name": "Sub Category 4", "count": 4},
],
}
],
}
]
To export this to json, you'll need a way to export the np.int64

Related

CosmosDb query to return arrays

I have the following data in my Collection
{
"id": "00000000-0000-0000-454c-4b74472b01d8",
"GroupId": 1,
"Location": "London",
"Status": "Ok"
},
{
"id": "d129adeb-d1bf-4a89-afe3-93e3f60589fb",
"GroupId": 1,
"Location": "Liverpool",
"Status": "Ok"
},
{
"id": "85ecf875-0e32-40b5-823a-a2545694f9b6",
"GroupId": 2,
"Location": "Manchester",
"Status": "Nok"
}
I need to build a query to get all possible value by Group for filtering.
Let's say for "GroupId": 1 I need result like
{
"Location": [
"London",
"Liverpool"
],
"Status": [
"Ok"
]
}
for "GroupId": 2 the response:
{
"Location": [
"Manchester",
],
"Status": [
"Nok"
]
}
Could you please help my to build such query? I don't know even if it possible with CosmosDb.
I have tried so far something like this but it doesn't work
select
(
select VALUE c.Location
FROM c
WHERE c.GroupId = 1
GROUP BY c.Location
) as Location,
(
select VALUE c.Status
FROM c
WHERE c.GroupId = 1
GROUP BY c.Status
) as Status
from c
WHERE c.GroupId = 1
and this
select
[
(SELECT VALUE [c.Location] from c)
] as Location,
[
(SELECT VALUE [c.Status] from c)
] as Status
from c
where c.GroupId = 1
Please help or suggest how to solve that. Thank you in advance.
It's not possible to do this with the way your data is modeled.
With the ARRAY expression you can do this in a subquery for arrays within your document. But not when the data spans documents as it is the case here.

python dictionary how can create (structured) unique dictionary list if the key contains list of values of other keys

I have below unstructured dictionary list which contains values of other keys in a list .
I am not sure if the question i ask is strange. this is the actual dictionary payload that we receive from source which not aligned with respective entry
[
{
"dsply_nm": [
"test test",
"test test",
"",
""
],
"start_dt": [
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00"
],
"exp_dt": [
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00"
],
"hrs_pwr": [
"14",
"12",
"13",
"15"
],
"make_nm": "test",
"model_nm": "test",
"my_yr": "1980"
}
]
"the length of list cannot not be expected and it could be more than 4 sometimes or less in some keys"
#Expected:
i need to check if the above dictionary are in proper structure or not and based on that it should return the proper dictionary list associate with each item
for eg:
def get_dict_list(items):
if type(items == not structure)
result = get_associated_dict_items_mapped
return result
else:
return items
#Final result
expected_dict_list=
[{"dsply_nm":"test test","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"14"},
{"dsply_nm":"test test","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"12","make_nm": "test",model_nm": "test","my_yr": "1980"},
{"dsply_nm":"","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"13"},
{"dsply_nm":"","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"15"}
]
in above dictionary payload, below part is associated with the second dictionary items and have to map respectively
"make_nm": "test",
"model_nm": "test",
"my_yr": "1980"
}
Can anyone help on this?
Thanks
Since customer details is a list
dict(zip(customer_details[0], list(customer_details.values[0]())))
this yields:
{'insured_details': ['asset', 'asset', 'asset'],
'id': ['213', '214', '233'],
'dept': ['account', 'sales', 'market'],
'salary': ['12', '13', '14']}
​
I think a couple of list comprehensions will get you going. If you would like me to unwind them into more traditional for loops, just let me know.
import json
def get_dict_list(item):
first_value = list(item.values())[0]
if not isinstance(first_value, list):
return [item]
return [{key: item[key][i] for key in item.keys()} for i in range(len(first_value))]
cutomer_details = [
{
"insured_details": "asset",
"id": "xxx",
"dept": "account",
"salary": "12"
},
{
"insured_details": ["asset", "asset", "asset"],
"id":["213","214","233"],
"dept":["account","sales","market"],
"salary":["12","13","14"]
}
]
cutomer_details_cleaned = []
for detail in cutomer_details:
cutomer_details_cleaned.extend(get_dict_list(detail))
print(json.dumps(cutomer_details_cleaned, indent=4))
That should give you:
[
{
"insured_details": "asset",
"id": "xxx",
"dept": "account",
"salary": "12"
},
{
"insured_details": "asset",
"id": "213",
"dept": "account",
"salary": "12"
},
{
"insured_details": "asset",
"id": "214",
"dept": "sales",
"salary": "13"
},
{
"insured_details": "asset",
"id": "233",
"dept": "market",
"salary": "14"
}
]

How to find common struct for all documents in collection?

I have an array of documents, that have more or less same structure. But I need find fields that present in all documents. Somethink like:
{
"name": "Jow",
"salary": 7000,
"age": 25,
"city": "Mumbai"
},
{
"name": "Mike",
"backname": "Brown",
"sex": "male",
"city": "Minks",
"age": 30
},
{
"name": "Piter",
"hobby": "footbol",
"age": 25,
"location": "USA"
},
{
"name": "Maria",
"age": 22,
"city": "Paris"
},
All docs have name and age. How to find them with ArangoDB?
You could do the following:
Retrieve the attribute names of each document
Get the intersection of those attributes
i.e.
LET attrs = (FOR item IN test RETURN ATTRIBUTES(item, true))
RETURN APPLY("INTERSECTION", attrs)
APPLY is necessary so each list of attributes in attrs can be passed as a separate parameter to INTERSECTION.
Documentation:
ATTRIBUTES: https://www.arangodb.com/docs/stable/aql/functions-document.html#attributes
INTERSECTION: https://www.arangodb.com/docs/stable/aql/functions-array.html#intersection
APPLY: https://www.arangodb.com/docs/stable/aql/functions-miscellaneous.html#apply

Merge documents by fields

I have two types of docs. Main docs and additional info for it.
{
"id": "371"
"name": "Mike",
"location": "Paris"
},
{
"id": "371-1",
"age": 20,
"lastname": "Piterson"
}
I need to merge them by id, to get result doc. The result should look like:
{
"id": "371"
"name": "Mike",
"location": "Paris"
"age": 20,
"lastname": "Piterson"
}
Using COLLECT / INTO, SPLIT(), and MERGE():
FOR doc IN collection
COLLECT id = SPLIT(doc.id, '-')[0] INTO groups
RETURN MERGE(MERGE(groups[*].doc), {id})
Result:
[
{
"id": "371",
"location": "Paris",
"name": "Mike",
"lastname": "Piterson",
"age": 20
}
]
This will:
Split each id attribute at any - and return the first part
Group the results into sepearate arrays (groups)
Merge #1: Merge all objects into one
Merge #2: Merge the id into the result
See REMOVE & INSERT or REPLACE for write operations.

Aggregation in arangodb using AQL

I'm attempting a fairly basic task in arangodb, using the SUM() aggregate function.
Here is a working query which returns the right data (though not yet aggregated):
FOR m IN pkg_spp_RegMem
FILTER m.memberId == "40289"
COLLECT member = m.memberId INTO g
RETURN { "memberId" : member, "amount" : g[*].m[*].items }
This returns the following results:
[
{
"memberId": "40289",
"amount": [
[
{
"amount": 50,
"description": "some description"
}
],
[
{
"amount": 50,
"description": "some description"
},
{
"amount": 500,
"description": "some description"
},
{
"amount": 0,
"description": "some description"
}
],
[
{
"amount": 0,
"description": "some description"
},
]
]
}
]
I am using Collect to group the results because a given memberId may have multiple'RegMem' objects. As you can see from the query/results, each object has a list of smaller objects called 'items', with each item having an amount and a description.
I want to SUM() the amounts by member. However, adjusting the query like this does not work:
FOR m IN pkg_spp_RegMem
FILTER m.memberId == "40289"
COLLECT member = m.memberId INTO g
RETURN { "memberId" : member, "amount" : SUM(g[*].m[*].items[*].amount) }
It returns 0 because it apparently can't find a field in the expanded items list called amount.
Looking at the results I can sort of understand why: the results are being returned such that items is actually a list, of lists of objects with amount/description. But I don't understand how to reference or expand the un-named list correctly to return the amount field values for the SUM() function.
Ideally the query should return the memberId and total amount, one row per member such that I can remove the filter and execute for all members.
Many thanks in advance if you can help!
Martin
PS I've worked through the AQL tutorial on the arangodb website and checked out the manual but what would really help me is loads more example queries to look through. If anyone knows of a resource like that or wants to share some of their own, 'much obliged. Cheers!
Edited: Misread the question the first time. The first one can be seen in theedit history, as it also contains some hints:
I replicated your data by creating some documents in this format (and some with only one item):
{
"memberId": "40289",
"items": [
{
"amount": 50,
"description": "some description"
},
{
"amount": 500,
"description": "some description"
}
]
}
Based on some of those types of documents, your non-summarized query should indeed be looking like this:
FOR m IN pkg_spp_RegMem
FILTER m.memberId == "40289"
COLLECT member = m.memberId INTO g
RETURN { "memberId" : member, "amount" : g[*].m[*].items }
The data returned:
[
{
"memberId": "40289",
"amount": [
[
{
"amount": 50,
"description": "some description"
},
{
"amount": 0,
"description": "some description"
}
],
[
{
"amount": 50,
"description": "some description"
},
{
"amount": 0,
"description": "some description"
}
],
[
{
"amount": 50,
"description": "some description"
}
],
[
{
"amount": 50,
"description": "some description"
},
{
"amount": 500,
"description": "some description"
}
],
[
{
"amount": 0,
"description": "some description"
}
],
[
{
"amount": 50,
"description": "some description"
},
{
"amount": 500,
"description": "some description"
}
]
]
}
]
Based on the non summarized version, you need to loop through the items of the groups that have been generated by the collect function and do your SUM() there.
In order to be able to SUM the items you must FLATTEN() them into a single list, before summarizing them.
FOR m IN pkg_spp_RegMem
FILTER m.memberId == "40289"
COLLECT member = m.memberId INTO g
RETURN { "memberId" : member, "amount" : SUM(
FLATTEN(
(
FOR r in g[*].m[*].items
RETURN r[*].amount
)
)
)
}
This results in:
[
{
"memberId": "40289",
"amount": 1250
}
]

Resources