Filter Out Data in Json - python-3.x

I have the below json file, where i need to filter the City data based on flag value equals to true
"attributes": { "State":
[ { "type": "sc", "ov": true, "value": "TN" } ],
"City": [ { "type": "c", "flag": true, "value": "Chennai" },
{ "type": "c", "flag": false, "value": "Coimbatore" } ],
}
Expecting the output as below
State: TN
City: Chennai

you can write something like below to only filter city based on flag is True.
import json
data = json.loads(open("/home/imtiaz/tmp/data1.json").read())
data1 = [city for city in data['attributes']['City'] if city['flag'] is True]
data['attributes']['City'] = data1

Just load the json file into memory, then use a list comprehension to filter for where the flag is true.
import json
with open('yourfile.json', 'r') as citydata:
cities_dict = json.load(citydata)
true_cities = [city for city in cities_dict['attributes']['City'] if city['flag']]
This won't mutate the original data, and will return a separate list of cities where the flag is true. You can just set the same list to the list comprehension's return value to mutate the original data in memory, such as:
cities_dict['attributes']['City'] = [city for city in cities_dict['attributes']['City'] if city['flag']]

Related

Spark DataFrame change datatype based on JSON value

I'm constructing a dataframe from a JSON file and saving this dataframe to a parquet file. This parquet file is consumed by a PIG script for further processing.
Below is the schema of the JSON file:
{id:"1",
name:"test",
"fields": [
{
"fieldId": "ABC1.0",
"values": [
{
"key": "812320",
"formId": 11100,
"occ": 1,
"attachId": 0
}
]
},
{
"fieldId": "CDE2.0",
"values": [
{
"key": "MA",
"formId": 11100,
"occ": 1,
"attachId": 0
},
{
"key": 23.0,
"formId": 11100,
"occ": 1,
"attachId": 0
}
]
}
]
}
I need to set the data type of the "key" field based on its value. The value of the key could be string, long double, integer.
How Can I achieve this using spark dataframe/dataset.
First of all your content in the json file is wrong. All properties need to be enclosed in double quotes :-

Cosmos Db: How to query for the maximum value of a property in an array of arrays?

I'm not sure how to query when using CosmosDb as I'm used to SQL. My question is about how to get the maximum value of a property in an array of arrays. I've been trying subqueries so far but apparently I don't understand very well how they work.
In an structure such as the one below, how do I query the city with more population among all states using the Data Explorer in Azure:
{
"id": 1,
"states": [
{
"name": "New York",
"cities": [
{
"name": "New York",
"population": 8500000
},
{
"name": "Hempstead",
"population": 750000
},
{
"name": "Brookhaven",
"population": 500000
}
]
},
{
"name": "California",
"cities":[
{
"name": "Los Angeles",
"population": 4000000
},
{
"name": "San Diego",
"population": 1400000
},
{
"name": "San Jose",
"population": 1000000
}
]
}
]
}
This is currently not possible as far as I know.
It would look a bit like this:
SELECT TOP 1 state.name as stateName, city.name as cityName, city.population FROM c
join state in c.states
join city in state.cities
--order by city.population desc <-- this does not work in this case
You could write a user defined function that will allow you to write the query you probably expect, similar to this: CosmosDB sort results by a value into an array
The result could look like:
SELECT c.name, udf.OnlyMaxPop(c.states) FROM c
function OnlyMaxPop(states){
function compareStates(stateA,stateB){
stateB.cities[0].poplulation - stateA.cities[0].population;
}
onlywithOneCity = states.map(s => {
maxpop = Math.max.apply(Math, s.cities.map(o => o.population));
return {
name: s.name,
cities: s.cities.filter(x => x.population === maxpop)
}
});
return onlywithOneCity.sort(compareStates)[0];
}
You would probably need to adapt the function to your exact query needs, but I am not certain what your desired result would look like.

How to extract selected key and value from nested dictionary object in a list?

I have a list example_list contains two dict objects, it looks like this:
[
{
"Meta": {
"ID": "1234567",
"XXX": "XXX"
},
"bbb": {
"ccc": {
"ddd": {
"eee": {
"fff": {
"xxxxxx": "xxxxx"
},
"www": [
{
"categories": {
"ppp": [
{
"content": {
"name": "apple",
"price": "0.111"
},
"xxx: "xxx"
}
]
},
"date": "A2020-01-01"
}
]
}
}
}
}
},
{
"Meta": {
"ID": "78945612",
"XXX": "XXX"
},
"bbb": {
"ccc": {
"ddd": {
"eee": {
"fff": {
"xxxxxx": "xxxxx"
},
"www": [
{
"categories": {
"ppp": [
{
"content": {
"name": "banana",
"price": "12.599"
},
"xxx: "xxx"
}
]
},
"date": "A2020-01-01"
}
]
}
}
}
}
}
]
now I want to filter the items and only keep "ID": "xxx" and the correspoding value for "price": "0.111", expected result can be something similar to :
[{"ID": "1234567", "price": "0.111"}, {"ID": "78945612", "price": "12.599"}]
or something like {"1234567":"0.111", "78945612":"12.599" }
Here's what I've tried:
map_list=[]
map_dict={}
for item in example_list:
#get 'ID' for each item in 'meta'
map_dict['ID'] = item['meta']['ID']
# get 'price'
data_list = item['bbb']['ccc']['ddd']['www']
for data in data_list:
for dataitem in data['categories']['ppp']
map_dict['price'] = item["content"]["price"]
map_list.append(map_dict)
print(map_list)
The result for this doesn't look right, feels like the item isn't iterating properly, it gives me result:
[{"ID": "78945612", "price": "12.599"}, {"ID": "78945612", "price": "12.599"}]
It gave me duplicated result for the second ID but where is the first ID?
Can someone take a look for me please, thanks.
Update:
From some comments from another question, I understand the reason for the output keeps been overwritten is because the key name in the dict is always the same, but I'm not sure how to fix this because the key and value needs to be extracted from different level of for loops, any help would be appreciated, thanks.
as #Scott Hunter has mentioned, you need to create a new map_dict everytime you are trying to do this. Here is a quick fix to your solution (I am sadly not able to test it right now, but it seems right to me).
map_list=[]
for item in example_list:
# get 'price'
data_list = item['bbb']['ccc']['ddd']['www']
for data in data_list:
for dataitem in data['categories']['ppp']:
map_dict={}
map_dict['ID'] = item['meta']['ID']
map_dict['price'] = item["content"]["price"]
map_list.append(map_dict)
print(map_list)
But what are you doing here is that you are basically just "forcing" your way through ... I recommend you to take a break and check out somekind of tutorial, which will help you to understand how it really works in the back-end. This is how I would have written it:
list_dicts = []
for example in example_list:
for www in item['bbb']['ccc']['ddd']['www']:
for www_item in www:
list_dicts.append({
'ID': item['meta']['ID'],
'price': www_item["content"]["price"]
})
Good luck with this problem and hope it helps :)
You need to create a new dictionary for map_dict for each ID.

python dictionary how can create (structured) unique dictionary list if the key contains list of values of other keys

I have below unstructured dictionary list which contains values of other keys in a list .
I am not sure if the question i ask is strange. this is the actual dictionary payload that we receive from source which not aligned with respective entry
[
{
"dsply_nm": [
"test test",
"test test",
"",
""
],
"start_dt": [
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00",
"2021-04-21T00:01:00-04:00"
],
"exp_dt": [
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00",
"2022-04-21T00:01:00-04:00"
],
"hrs_pwr": [
"14",
"12",
"13",
"15"
],
"make_nm": "test",
"model_nm": "test",
"my_yr": "1980"
}
]
"the length of list cannot not be expected and it could be more than 4 sometimes or less in some keys"
#Expected:
i need to check if the above dictionary are in proper structure or not and based on that it should return the proper dictionary list associate with each item
for eg:
def get_dict_list(items):
if type(items == not structure)
result = get_associated_dict_items_mapped
return result
else:
return items
#Final result
expected_dict_list=
[{"dsply_nm":"test test","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"14"},
{"dsply_nm":"test test","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"12","make_nm": "test",model_nm": "test","my_yr": "1980"},
{"dsply_nm":"","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"13"},
{"dsply_nm":"","start_dt":"2021-04-21T00:01:00-04:00","exp_dt":"2022-04-21T00:01:00-04:00","hrs_pwr":"15"}
]
in above dictionary payload, below part is associated with the second dictionary items and have to map respectively
"make_nm": "test",
"model_nm": "test",
"my_yr": "1980"
}
Can anyone help on this?
Thanks
Since customer details is a list
dict(zip(customer_details[0], list(customer_details.values[0]())))
this yields:
{'insured_details': ['asset', 'asset', 'asset'],
'id': ['213', '214', '233'],
'dept': ['account', 'sales', 'market'],
'salary': ['12', '13', '14']}
​
I think a couple of list comprehensions will get you going. If you would like me to unwind them into more traditional for loops, just let me know.
import json
def get_dict_list(item):
first_value = list(item.values())[0]
if not isinstance(first_value, list):
return [item]
return [{key: item[key][i] for key in item.keys()} for i in range(len(first_value))]
cutomer_details = [
{
"insured_details": "asset",
"id": "xxx",
"dept": "account",
"salary": "12"
},
{
"insured_details": ["asset", "asset", "asset"],
"id":["213","214","233"],
"dept":["account","sales","market"],
"salary":["12","13","14"]
}
]
cutomer_details_cleaned = []
for detail in cutomer_details:
cutomer_details_cleaned.extend(get_dict_list(detail))
print(json.dumps(cutomer_details_cleaned, indent=4))
That should give you:
[
{
"insured_details": "asset",
"id": "xxx",
"dept": "account",
"salary": "12"
},
{
"insured_details": "asset",
"id": "213",
"dept": "account",
"salary": "12"
},
{
"insured_details": "asset",
"id": "214",
"dept": "sales",
"salary": "13"
},
{
"insured_details": "asset",
"id": "233",
"dept": "market",
"salary": "14"
}
]

Dynamic Data Flow to separate unique records in Azure Data Factory

I have a requirement to read Parquet Files dynamically and extract unique records. Each file can have 1 or more key Columns.
Assuming the files are going to have 1 key column I have designed the below data flow with ID parameter.
In the aggregate transformation, I am grouping by the ID Column
Also allowing all other columns to pass-through
Note: please observe the Column is being read as ID and not AddressID
In the next step in select, I am trying to rename this ID as AddressID(using the parameter value).
the output show like this
I have tried giving values in names as hardcoded value (Address ID) and it
works.
Can some help me on how to rename this ID with AddressId (parameter value which key column name) dynamically?
Also, the above scenario is possible when there is 1 key column.
Is it possible to use the Azure Data Factory to handle a scenario when there are more than 1 key columns and process it dynamically?
Depending on this we will use adf or use ADB.
Data Flow Code:
{
"name": "RemoveDuplicateRows",
"properties": {
"type": "MappingDataFlow",
"typeProperties": {
"sources": [
{
"dataset": {
"referenceName": "DS_Parquet_DF",
"type": "DatasetReference"
},
"name": "source1"
}
],
"sinks": [
{
"dataset": {
"referenceName": "DS_Parquet_Cleaned",
"type": "DatasetReference"
},
"name": "sink1"
}
],
"transformations": [
{
"name": "Aggregate1"
},
{
"name": "Select1"
}
],
"script": "parameters{\n\tID as string ('AddressID')\n}\nsource(allowSchemaDrift: true,\n\tvalidateSchema: false,\n\tformat: 'parquet',\n\tpartitionBy('roundRobin', 2)) ~> source1\nsource1 aggregate(groupBy(ID = byName($ID)),\n\teach(match(name!=$ID), $$ = first($$))) ~> Aggregate1\nAggregate1 select(mapColumn(\n\t\teach(match(name=='ID'),\n\t\t\t'AddressID' = $$),\n\t\teach(match(name!='ID'))\n\t),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> Select1\nSelect1 sink(allowSchemaDrift: true,\n\tvalidateSchema: false,\n\tformat: 'parquet',\n\ttruncate: true,\n\tpartitionBy('roundRobin', 2),\n\tskipDuplicateMapInputs: true,\n\tskipDuplicateMapOutputs: true) ~> sink1"
}
}
}
DataFlow Script
parameters{
ID as string ('AddressID')
}
source(allowSchemaDrift: true,
validateSchema: false,
format: 'parquet',
partitionBy('roundRobin', 2)) ~> source1
source1 aggregate(groupBy(ID = byName($ID)),
each(match(name!=$ID), $$ = first($$))) ~> Aggregate1
Aggregate1 select(mapColumn(
each(match(name=='ID'),
'AddressID' = $$),
each(match(name!='ID'))
),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> Select1
Select1 sink(allowSchemaDrift: true,
validateSchema: false,
format: 'parquet',
truncate: true,
partitionBy('roundRobin', 2),
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> sink1

Resources