PySpark Dataframe to Json - grouping data - apache-spark

We are trying to create a json from a dataframe. Please find the dataframe below,
+----------+--------------------+----------+--------------------+-----------------+--------------------+---------------+--------------------+---------------+--------------------+--------------------+
| CustId| TIN|EntityType| EntityAttributes|AddressPreference| AddressDetails|EmailPreference| EmailDetails|PhonePreference| PhoneDetails| MemberDetails|
+----------+--------------------+----------+--------------------+-----------------+--------------------+---------------+--------------------+---------------+--------------------+--------------------+
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Alternate|[{Home, 460 M XXX...| Primary|[{Home, HEREBY...| Alternate|[{Home, {88888888...|[{7777777, 999999...|
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Alternate|[{Home, 460 M XXX...| Primary|[{Home, HEREBY...| Primary|[{Home, {88888888...|[{7777777, 999999...|
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Primary|[{Home, PO BOX 695020...| Primary|[{Home, HEREBY...| Alternate|[{Home, {88888888...|[{7777777, 999999...|
|1234567890|XXXXXXXXXXXXXXXXXX...| Person|[{null, PRINCESS,...| Primary|[{Home, PO BOX 695020...| Primary|[{Home, HEREBY...| Primary|[{Home, {88888888...|[{7777777, 999999...|
+----------+--------------------+----------+--------------------+-----------------+--------------------+---------------+--------------------+---------------+--------------------+--------------------+
So the initial columns custid, TIN, Entitytype,EntityAttributes will be same for a particular customer, say 1234567890 in our example. But he might be having multiple addresses/phone/email. Could you please help us on how to group them under 1 json.
Expected Structure :
{
"CustId": 1234567890,
"TIN": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"EntityType": "Person",
"EntityAttributes": [
{
"FirstName": "PRINCESS",
"LastName": "XXXXXX",
"BirthDate": "xxxx-xx-xx",
"DeceasedFlag": "False"
}
],
"Address": [
{
"AddressPreference": "Alternate",
"AddressDetails": {
"AddressType": "Home",
"Address1": "460",
"City": "XXXX",
"State": "XXX",
"Zip": "XXXX"
}
},
{
"AddressPreference": "Primary",
"AddressDetails": {
"AddressType": "Home",
"Address1": "PO BOX 695020",
"City": "XXX",
"State": "XXXX",
"Zip": "695020",
}
}
],
"Phone": [
{
"PhonePreference": "Primary",
"PhoneDetails": {
"PhoneType": "Home",
"PhoneNumber": "xxxxx",
"FormatPhoneNumber": "xxxxxx"
}
},
{
"PhonePreference": "Alternate",
"PhoneDetails": {
"PhoneType": "Home",
"PhoneNumber": "xxxx",
"FormatPhoneNumber": "xxxxx"
}
},
{
],
"Email": [
{
"EmailPreference": "Primary",
"EmailDetails": {
"EmailType": "Home",
"EmailAddress": "xxxxxxx#GMAIL.COM"
}
}
],
}
]
}
UPDATE
Tried with the below recommended group by method, it ended up giving 1 customer details, but the email is repeated 4 times in the list. Ideally it should be having only 1 email. Also In the Address Preference Alternate has 1 address and primary has 1 address, but the Alternate shows 2 entries and primary shows 2. Could you please help with an ideal solution.

Probably this should work. id is like a custid in your example which has repeating values.
>>> df.show()
+----+------------+----------+
| id| address| email|
+----+------------+----------+
|1001| address-a| email-a|
|1001| address-b| email-b|
|1002|address-1002|email-1002|
|1003|address-1003|email-1002|
|1002| address-c| email-2|
+----+------------+----------+
Aggregate on those repeating columns and then convert to JSON
>>> results = df.groupBy("id").agg(collect_list("address").alias("address"),collect_list("email").alias("email")).toJSON().collect()
>>> for i in results: print(i)
...
{"id":"1003","address":["address-1003"],"email":["email-1002"]}
{"id":"1002","address":["address-1002","address-c"],"email":["email-1002","email-2"]}
{"id":"1001","address":["address-a","address-b"],"email":["email-a","email-b"]}

Related

Syntax for JSONPath filtering to not return array

I'm new to JSONPath and want to write a JSONPath-syntax that retrieves the property value only if a certain condition is met. The value I'm after is not part of an array, but I've managed to make filtering work in the following JSONPath tool: https://www.site24x7.com/tools/json-path-evaluator.html
Given the following JSON, I only want to extract the value of column2.dimValue if column2.attributeId equals B0:
{
"batchId": 279,
"companyId": "40",
"period": 202208,
"taxCode": "1",
"taxSystem": "",
"transactionDate": "2022-08-05T00:00:00.000",
"transactionNumber": 222006089,
"transactionType": "IF",
"year": 2022,
"accountingInformation": {
"account": "4010",
"column1": {
"attributeId": "H9",
"dimValue": "76"
},
"column2": {
"attributeId": "B0",
"dimValue": "2170103"
},
"column3": {
"attributeId": "",
"dimValue": ""
},
"column4": {
"attributeId": "BF",
"dimValue": "217010330"
},
"column5": {
"attributeId": "10",
"dimValue": "3101"
},
"column6": {
"attributeId": "06",
"dimValue": ""
},
"column7": {
"attributeId": "19",
"dimValue": "K"
}
},
"categories": {
"cat1": "H9",
"cat2": "B0",
"cat3": "",
"cat4": "BF",
"cat5": "10",
"cat6": "06",
"cat7": "19",
"dim1": "76",
"dim2": "2170103",
"dim3": "",
"dim4": "217010330",
"dim5": "3101",
"dim6": "",
"dim7": "K"
},
"amounts": {
"amount": 48.24,
"amount3": 0.0,
"amount4": 0.0,
"currencyAmount": 48.24,
"currencyCode": "NOK",
"debitCreditFlag": 1
},
"invoice": {
"customerOrSupplierId": "58118",
"description": "",
"externalArchiveReference": "",
"externalReference": "2170103",
"invoiceNumber": "220238522",
"ledgerType": "P"
},
"additionalInformation": {
"number": 0,
"orderLineNumber": 0,
"orderNumber": 0,
"sequenceNumber": 1,
"status": "",
"value": 0.0,
"valueDate": "2022-08-05T00:00:00.000"
},
"lastUpdated": {
"updatedAt": "2022-09-05T10:59:11.633",
"updatedBy": "HELVES"
}
}
I've used this JSONPath-syntax:
$['accountingInformation']['column2'][?(#.attributeId=='B0')].dimValue
This gives the following result:
[
"2170103"
]
I'm using this result in Azure Data Factory mapping, and it seems that it doesn't work as the result is an array.
Can anyone help me with the syntax to it only returns the actual value? Is that even possible?
I repro'd the same and below is the approach.
Sample Json file is taken as in below image as a source in lookup activity.
If activity is taken to filter the value of column2 with attributeId='B0'. Expression is given as below
#equals(activity('Lookup1').output.value[0].accountingInformation.column2.attributeId ,'B0')
In true case of IF activity, Set Variable is added. New Variable with string type is taken and it is set using below expression.
#activity('Lookup1').output.value[0].accountingInformation.column2.dimvalue
Then Copy activity is added next to IF activity sequentially. In source dummy dataset is taken. +New is click in additional columns
Name: col1
Value: #variables('v2')
In Mapping, Import schemas is clicked. All other columns except the additional column that is added in source are deleted.
Pipeline is debugged and data is copied to sink without error.

python dictionary how can create (structured) unique dictionary list if the key contains list of values of other keys

I have below unstructured dictionary list which contains values of other keys in a list .
I am not sure if the question i ask is strange. this is the actual dictionary payload that we receive from source which not aligned with respective entry
[
{
"dsply_nm": [
"test test",
"test test",
"",
""
],
"start_dt": [
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00"
],
"exp_dt": [
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00"
],
"hrs_pwr": [
"14",
"12",
"13",
"15"
],
"make_nm": "test",
"model_nm": "test",
"my_yr": "1980"
}
]
"the length of list cannot not be expected and it could be more than 4 sometimes or less in some keys"
#Expected:
i need to check if the above dictionary are in proper structure or not and based on that it should return the proper dictionary list associate with each item
for eg:
def get_dict_list(items):
if type(items == not structure)
result = get_associated_dict_items_mapped
return result
else:
return items
#Final result
expected_dict_list=
[{"dsply_nm":"test test","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"14"},
{"dsply_nm":"test test","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"12","make_nm": "test",model_nm": "test","my_yr": "1980"},
{"dsply_nm":"","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"13"},
{"dsply_nm":"","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"15"}
]
in above dictionary payload, below part is associated with the second dictionary items and have to map respectively
"make_nm": "test",
"model_nm": "test",
"my_yr": "1980"
}
Can anyone help on this?
Thanks
Since customer details is a list
dict(zip(customer_details[0], list(customer_details.values[0]())))
this yields:
{'insured_details': ['asset', 'asset', 'asset'],
'id': ['213', '214', '233'],
'dept': ['account', 'sales', 'market'],
'salary': ['12', '13', '14']}
​
I think a couple of list comprehensions will get you going. If you would like me to unwind them into more traditional for loops, just let me know.
import json
def get_dict_list(item):
first_value = list(item.values())[0]
if not isinstance(first_value, list):
return [item]
return [{key: item[key][i] for key in item.keys()} for i in range(len(first_value))]
cutomer_details = [
{
"insured_details": "asset",
"id": "xxx",
"dept": "account",
"salary": "12"
},
{
"insured_details": ["asset", "asset", "asset"],
"id":["213","214","233"],
"dept":["account","sales","market"],
"salary":["12","13","14"]
}
]
cutomer_details_cleaned = []
for detail in cutomer_details:
cutomer_details_cleaned.extend(get_dict_list(detail))
print(json.dumps(cutomer_details_cleaned, indent=4))
That should give you:
[
{
"insured_details": "asset",
"id": "xxx",
"dept": "account",
"salary": "12"
},
{
"insured_details": "asset",
"id": "213",
"dept": "account",
"salary": "12"
},
{
"insured_details": "asset",
"id": "214",
"dept": "sales",
"salary": "13"
},
{
"insured_details": "asset",
"id": "233",
"dept": "market",
"salary": "14"
}
]

How to give my own _id while inserting data in Elasticsearch?

I have a sample database as below:
SNO
Name
Address
99123
Mike
Texas
88124
Tom
California
I want to keep my SNO in elastic search _id to make it easier to update documents according to my SNO.
Python code to create an index:
abc = {
"settings": {
"number_of_shards": 2,
"number_of_replicas": 2
}
}
es.indices.create(index='test',body = abc)
I fetched data from postman as below:
{
"_index": "test",
"_id": "13",
"_data": {
"FirstName": "Sample4",
"LastName": "ABCDEFG",
"Designation": "ABCDEF",
"Salary": "99",
"DateOfJoining": "2020-05-05",
"Address": "ABCDE",
"Gender": "ABCDE",
"Age": "21",
"MaritalStatus": "ABCDE",
"Interests": "ABCDEF",
"timestamp": "2020-05-05T14:42:46.394115",
"country": "Nepal"
}
}
And Insert code in python is below:
req_JSON = request.json
input_index = req_JSON['_index']
input_id = req_JSON['_id']
input_data = req_JSON['_data']
doc = input_data
res = es.index(index=input_index, body=doc)
I thought _id will remain the same as what I had given but it generated the auto _id.
You can simply do it like this:
res = es.index(index=input_index, body=doc, id=input_id)
^
|
add this

How to find common struct for all documents in collection?

I have an array of documents, that have more or less same structure. But I need find fields that present in all documents. Somethink like:
{
"name": "Jow",
"salary": 7000,
"age": 25,
"city": "Mumbai"
},
{
"name": "Mike",
"backname": "Brown",
"sex": "male",
"city": "Minks",
"age": 30
},
{
"name": "Piter",
"hobby": "footbol",
"age": 25,
"location": "USA"
},
{
"name": "Maria",
"age": 22,
"city": "Paris"
},
All docs have name and age. How to find them with ArangoDB?
You could do the following:
Retrieve the attribute names of each document
Get the intersection of those attributes
i.e.
LET attrs = (FOR item IN test RETURN ATTRIBUTES(item, true))
RETURN APPLY("INTERSECTION", attrs)
APPLY is necessary so each list of attributes in attrs can be passed as a separate parameter to INTERSECTION.
Documentation:
ATTRIBUTES: https://www.arangodb.com/docs/stable/aql/functions-document.html#attributes
INTERSECTION: https://www.arangodb.com/docs/stable/aql/functions-array.html#intersection
APPLY: https://www.arangodb.com/docs/stable/aql/functions-miscellaneous.html#apply

How to add extra data in a json request

In my SOAP UI ihave two steps, a groovy script step and a rest request step for a POST crud method.
In the groovy script I am creating a random test case property named 'adults'. This value is a random value between 2-5.
testRunner.testCase.setPropertyValue('adults', String.valueOf((int)Math.random()*5)+2);
Below is my rest request for the POST:
{
"xxx": "xxx",
"ratePlanCode": "xxx"
"roomOccupancies": [
{
"passengersInformation": [
{
"firstName": "Test",
"lastName": "Tester",
"isLeadPassenger": true,
"age": 30
},
]
}
],
"xxx": "xxx"
}
Now this request is fixed for 1 adult passenger, but the issue is that if I have multiple passengers, I actually need multiple passengers under "passengersInformation". So virtually for every extra adult I need to add:
{
"firstName": "Test",
"lastName": "Tester",
"isLeadPassenger": false,
"age": 30
},
So what i am thinking is for the name of the passenger as we are not allowed duplicate names, we just add a number to the end of the first and last name. The other two fields we can keep the same.
So my question is how do we add additional passenger details within the request based on the number of adults randomly selected from the groovy script?
Thank you,
Here's one way to replicate the passenger: Note I had to fix a couple of commas (extra and missing) in the JSON string.
import groovy.json.*
def jsonData = '''{
"hotelArrivalDate": "2017-06-01T18:15:00",
"ratePlanCode": "xxx=",
"roomOccupancies": [
{
"passengersInformation": [
{
"firstName": "Test",
"lastName": "Tester",
"isLeadPassenger": true,
"age": 30
}
]
}
],
"holidaysBookingReference": "TestRef"
}'''
def n=1
def data = (new JsonSlurper()).parseText(jsonData)
def newPerson = data.roomOccupancies[0].
passengersInformation[0].
collectEntries {k,v ->
['firstName','lastName'].contains(k) ? [k,v+n] : [k,v]
}
data.roomOccupancies[0].passengersInformation << newPerson
jsonData = (new JsonBuilder(data)).toPrettyString()
result
{
"hotelArrivalDate": "2017-06-01T18:15:00",
"ratePlanCode": "xxx=",
"roomOccupancies": [
{
"passengersInformation": [
{
"firstName": "Test",
"lastName": "Tester",
"isLeadPassenger": true,
"age": 30
},
{
"firstName": "Test1",
"lastName": "Tester1",
"isLeadPassenger": true,
"age": 30
}
]
}
],
"holidaysBookingReference": "TestRef"
}

Resources