nested dictionary in list to dataframe python - python-3.x

Have a json input from api:
{
"api_info": {
"status": "healthy"
},
"items": [
{
"timestamp": "time",
"stock_data": [
{
"ticker": "string",
"industry": "string",
"Description": "string"
}
]
"ISIN":xxx,
"update_datetime": "time"
}
]
}
have initially run
apiRawData = requests.get(url).json()['items']
then ran the json_normalize method:
apiExtractedData = pd.json_normalize(apiRawData,'stock_data',errors='ignore')
Here is the initial output where the stock_data is still contained within a list.
stock_data ISIN update_datetime
0 [{'description': 'zzz', 'industry': 'C', 'ticker... xxx time
stock_data
ISIN
update_datetime
0
[{'description': 'zzz', 'industry': 'C', 'ticker...]
123
time
What i would like to achieve is a dataframe showing the headers and the corresponding rows:
description
industry
ticker
ISIN
update_datetime
0
'zzz'
'C'
xxx
123
time
Do direct me if there is already an existing question answered :) cheers.

I think you can simply convert your existing data frame into your expected one by using below code:
apiExtractedData['description'] = apiExtractedData['stock_data'].apply(lambda x: x[0]['description'])
apiExtractedData['industry'] = apiExtractedData['stock_data'].apply(lambda x: x[0]['industry'])
apiExtractedData['ticker'] = apiExtractedData['stock_data'].apply(lambda x: x[0]['ticker'])
And then just delete your stock_data column:
apiExtractedData = apiExtractedData.drop(['stock_data'], axis = 1)

Related

How to (dynamically) join array with a struct, to get a value from the struct for each element in the array?

I am trying to parse/flatten a JSON data, containing an array and a struct.
For every "Id" in "data_array" column, I need to get the "EstValue" from "data_struct" column. Column name in "data_struct" is the actual id (from "data_array"). Tried my best to use a dynamic join, but getting error "Column is not iterable". Can't we use dynamic join conditions in PySpark, like we can in SQL? Is there any better way for achieving this?
JSON Input file:
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
Desired output:
Id Name EstValue CompValue
1 ABC 123 1234
2 DEF 456 4567
My PySpark code:
from pyspark.sql.functions import *
rawDF = spark.read.json([f"abfss://{pADLSContainer}#{pADLSGen2}.dfs.core.windows.net/{pADLSDirectory}/InputFile.json"], multiLine = "true")
idDF = rawDF.select(explode("data_array").alias("data_array")) \
.select(col("data_array.id").alias("id"))
idDF.show(n=2,vertical=True,truncate=150)
finalDF = idDF.join(rawDF, (idDF.id == rawDF.select(col("data_struct." + idDF.Id))) )
finalDF.show(n=2,vertical=True,truncate=150)
Error:
def __iter__(self): raise TypeError("Column is not iterable")
Self joins create problems. In this case, you can avoid the join.
You could make arrays from both columns, zip them together and use inline to extract into columns. The most difficult part is creating array from "data_struct" column. Maybe there's a better way, but I only could think of first transforming it into map type.
Input:
s = """
{
"data_array": [
{
"id": 1,
"name": "ABC"
},
{
"id": 2,
"name": "DEF"
}
],
"data_struct": {
"1": {
"estimated": {
"value": 123
},
"completed": {
"value": 1234
}
},
"2": {
"estimated": {
"value": 456
},
"completed": {
"value": 4567
}
}
}
}
"""
rawDF = spark.read.json(sc.parallelize([s]), multiLine = "true")
Script:
id = F.transform('data_array', lambda x: x.id).alias('Id')
name = F.transform('data_array', lambda x: x['name']).alias('Name')
map = F.from_json(F.to_json("data_struct"), 'map<string, struct<estimated:struct<value:long>,completed:struct<value:long>>>')
est_val = F.transform(id, lambda x: map[x].estimated.value).alias('EstValue')
comp_val = F.transform(id, lambda x: map[x].completed.value).alias('CompValue')
df = rawDF.withColumn('y', F.arrays_zip(id, name, est_val, comp_val))
df = df.selectExpr("inline(y)")
df.show()
# +---+----+--------+---------+
# | Id|Name|EstValue|CompValue|
# +---+----+--------+---------+
# | 1| ABC| 123| 1234|
# | 2| DEF| 456| 4567|
# +---+----+--------+---------+

Is this the best way to parse a Json output from Google Ads Stream

Is this the best way to parse a Json output from Google Ads Stream. I am parsing the json with pandas & it is taking too much time
record counts is around 700K
[{
"results": [
{
"customer": {
"resourceName": "customers/12345678900",
"id": "12345678900",
"descriptiveName": "ABC"
},
"campaign": {
"resourceName": "customers/12345678900/campaigns/12345",
"name": "Search_Google_Generic",
"id": "12345"
},
"adGroup": {
"resourceName": "customers/12345678900/adGroups/789789",
"id": "789789",
"name": "adgroup_details"
},
"metrics": {
"clicks": "500",
"conversions": 200,
"costMicros": "90000000",
"allConversionsValue": 5000.6936,
"impressions": "50000"
},
"segments": {
"device": "DESKTOP",
"date": "2022-10-28"
}
}
],
"fieldMask": "segments.date,customer.id,customer.descriptiveName,campaign.id,campaign.name,adGroup.id,adGroup.name,segments.device,metrics.costMicros,metrics.impressions,metrics.clicks,metrics.conversions,metrics.allConversionsValue",
"requestId": "fdhfgdhfgjf"
}
]
This is the sample json.I am saving the stream in json file and then reading using pandas and trying to dump in csv file
I want to convert it to CSV format, Like
with open('Adgroups.json', encoding='utf-8') as inputfile:
df = pd.read_json(inputfile)
df_new = pd.DataFrame(columns= ['Date', 'Account_ID', 'Account', 'Campaign_ID','Campaign',
'Ad_Group_ID', 'Ad_Group','Device',
'Cost', 'Impressions', 'Clicks', 'Conversions', 'Conv_Value'])
for i in range(len(df['results'])):
results = df['results'][i]
for result in results:
new_row = pd.Series({ 'Date': result['segments']['date'],
'Account_ID': result['customer']['id'],
'Account': result['customer']['descriptiveName'],
'Campaign_ID': result['campaign']['id'],
'Campaign': result['campaign']['name'],
'Ad_Group_ID': result['adGroup']['id'],
'Ad_Group': result['adGroup']['name'],
'Device': result['segments']['device'],
'Cost': result['metrics']['costMicros'],
'Impressions': result['metrics']['impressions'],
'Clicks': result['metrics']['clicks'],
'Conversions': result['metrics']['conversions'],
'Conv_Value': result['metrics']['allConversionsValue']
})
df_new = df_new.append(new_row, ignore_index = True)
df_new.to_csv('Adgroups.csv', encoding='utf-8', index=False)
Don't use df.append. It's very slow because it has to copy the dataframe over and over again. I think it's being deprecated for this reason.
You can build the rows using list comprehension before constructing the data frame:
import json
with open("Adgroups.json") as fp:
data = json.load(fp)
columns = [
"Date",
"Account_ID",
"Account",
"Campaign_ID",
"Campaign",
"Ad_Group_ID",
"Ad_Group",
"Device",
"Cost",
"Impressions",
"Clicks",
"Conversions",
"Conv_Value",
]
records = [
(
r["segments"]["date"],
r["customer"]["id"],
r["customer"]["descriptiveName"],
r["campaign"]["id"],
r["campaign"]["name"],
r["adGroup"]["id"],
r["adGroup"]["name"],
r["segments"]["device"],
r["metrics"]["costMicros"],
r["metrics"]["impressions"],
r["metrics"]["clicks"],
r["metrics"]["conversions"],
r["metrics"]["allConversionsValue"],
)
for d in data
for r in d["results"]
]
df = pd.DataFrame(records, columns=columns)

Python - How to get all keys where values are zero from list of dictionaries?

I have list of dictionaries that looks like this:
planets = [
{ "name": "Mercury", "moonCount": 0 },
{ "name": "Venus", "moonCount": 0 },
{ "name": "Earth", "moonCount": 1 },
{
"name": "Mars",
"moonCount": 2
},
{
"name": "Jupiter",
"moonCount": 67
},
{
"name": "Saturn",
"moonCount": 62
},
{
"name": "Uranus",
"moonCount": 27
},
{
"name": "Neptune",
"moonCount": 13
},
{
"name": "Pluto",
"moonCount": 4
}
]
I am trying to get the list of all planets which do not have moons.
I can do this very simply if turn this into panda dataframe like this:
import pandas as pd
df = pd.json_normalize(planets)
no_moon_planets = df.loc[df['moonCount']==0, 'name']
no_moon_planets = no_moon_planets.tolist()
no_moon_planets
But I need to do this without creating panda dataframe. So basically looking for solution where I can extract the list of the planet directly from list of dictionaries where the number of the moons iz zero.
Any help is appreciated.
Do the following :
Planets_without_moon = [i['name'] for i in planets if i['moonCount']==0]
You can simply do it with list comprehension:
zero_moon = [dic for dic in planets if dic["moonCount"]==0]
if you just want the name of the planet:
zero_moon = [dic["name"] for dic in planets if dic["moonCount"]==0]

JSON Extract to dataframe using python

I have a JSON file and the structure of the file is as below
[json file with the structure][1]
I am trying to get all the details into dataframe or tabular form, Tried using denormalize and could not get the actual result.
{
"body": [{
"_id": {
"s": 0,
"i": "5ea6c8ee24826b48cc560e1c"
},
"fdfdsfdsf": "V2_1_0",
"dsd": "INDIA-",
"sdsd": "df-as-3e-ds",
"dsd": 123,
"dsds": [{
"dsd": "s_10",
"dsds": [{
"dsdsd": "OFFICIAL",
"dssd": {
"dsds": {
"sdsd": "IND",
"dsads": 0.0
}
},
"sadsad": [{
"fdsd": "ABC",
"dds": {
"dsd": "INR",
"dfdsfd": -1825.717444
},
"dsss": [{
"id": "A:B",
"dsdsd": "A.B"
}
]
}, {
"name": "dssadsa",
"sadds": {
"sdsads": "INR",
"dsadsad": 180.831415
},
"xcs": "L:M",
"sds": "L.M"
}
]
}
]
}
]
}
]
}
This structure is far too nested to put directly into a dataframe. First, you'll need to use the ol' flatten_json function. This function isn't in a library (to my knowledge), but you see it around a lot. Save it somewhere.
def flatten_json(nested_json):
"""
Flatten json object with nested keys into a single level.
Args:
nested_json: A nested json object.
Returns:
The flattened json object if successful, None otherwise.
"""
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
Applying it to your data:
import json
with open('deeply_nested.json', r) as f:
flattened_json = flatten_json(json.load(f))
df = pd.json_normalize(flattened_json)
df.columns
Index(['body_0__id_s', 'body_0__id_i', 'body_0_schemaVersion',
'body_0_snapUUID', 'body_0_jobUUID', 'body_0_riskSourceID',
'body_0_scenarioSets_0_scenario',
'body_0_scenarioSets_0_modelSet_0_modelPolicyLabel',
'body_0_scenarioSets_0_modelSet_0_valuation_pv_unit',
'body_0_scenarioSets_0_modelSet_0_valuation_pv_value',
'body_0_scenarioSets_0_modelSet_0_measures_0_name',
'body_0_scenarioSets_0_modelSet_0_measures_0_value_unit',
'body_0_scenarioSets_0_modelSet_0_measures_0_value_value',
'body_0_scenarioSets_0_modelSet_0_measures_0_riskFactors_0_id',
'body_0_scenarioSets_0_modelSet_0_measures_0_riskFactors_0_underlyingRef',
'body_0_scenarioSets_0_modelSet_0_measures_1_name',
'body_0_scenarioSets_0_modelSet_0_measures_1_value_unit',
'body_0_scenarioSets_0_modelSet_0_measures_1_value_value',
'body_0_scenarioSets_0_modelSet_0_measures_1_riskFactors',
'body_0_scenarioSets_0_modelSet_0_measures_1_underlyingRef'],
dtype='object')

Creating Nested JSON from Dataframe

I have a dataframe and have to convert it into nested JSON.
countryname name text score
UK ABC Hello 5
Right now, I have some code that generates JSON, grouping countryname and name.
However, I want to firstly group by countryname and then group by name. Below is the code and output:
cols = test.columns.difference(['countryname','name'])
j = (test.groupby(['countryname','name'])[cols]
.apply(lambda x: x.to_dict('r'))
.reset_index(name='results')
.to_json(orient='records'))
test_json = json.dumps(json.loads(j), indent=4)
Output:
[
{
"countryname":"UK"
"name":"ABC"
"results":[
{
"text":"Hello"
"score":"5"
}
]
}
]
However, I am expecting an output like this:
[
{
"countryname":"UK"
{
"name":"ABC"
"results":[
{
"text":"Hello"
"score":"5"
}
]
}
}
]
Can anyone please help in fixing this?
This would be the valid JSON. Note the comma , usage, is required as you may check here.
[
{
"countryname":"UK",
"name":"ABC",
"results":[
{
"text":"Hello",
"score":"5"
}
]
}
]
The other output you try to achieve is also not according to the standard:
[{
"countryname": "UK",
"you need a name in here": {
"name": "ABC",
"results": [{
"text": "Hello",
"score": "5"
}]
}
}]
I improved that so you can figure out what name to use.
For custom JSON output you will need to use custom function to reformat your object first.
l=df.to_dict('records')[0] #to get the list
print(l, type(l)) #{'countryname': 'UK', 'name': 'ABC', 'text': 'Hello', 'score': 5} <class 'dict'>
e = l['countryname']
print(e) # UK
o=[{
"countryname": l['countryname'],
"you need a name in here": {
"name": l['name'],
"results": [{
"text": l['text'],
"score": l['score']
}]
}
}]
print(o) #[{'countryname': 'UK', 'you need a name in here': {'name': 'ABC', 'results': [{'text': 'Hello', 'score': 5}]}}]

Resources