Protect a nested object against flattening when using pandas.json_normalize - python-3.x

In pandas >= 1.1.4 / Python 3, I would like to protect a nested element against flattening when using json_normalize().
I cannot figure out such thing in the documentation.
Actual example
Here's a concrete example to figure out the main idea:
res='''
{
"results": [
{
"geometry": {
"type": "Polygon",
"crs": 4326,
"coordinates":
[[
[6.0, 49.0],
[6.0, 40.0],
[7.0, 40.0],
[7.0, 49.0],
[6.0, 49.0]
]]
},
"attribute": "layer.metadata",
"bbox": [6, 40, 7, 49],
"featureName": "Coniferous_Trees",
"layerName": "State_Forests",
"type": "Feature",
"id": "17",
"properties": {
"resolution": "100",
"Year": "2020",
"label": "Coniferous"
}
}
]
}
'''
This is a single JSON record from an API response. Here, there is only one element in the top level list, but there my be more, each following the same structure as the one shown here. I'd like to import this into a DataFrame without columns containing structured element, namely, I want to flatten / normalize them all. Well,... almost all. json_normalize() is a doing an amazing job in doing that:
import pandas as pd
data = json.loads(res)['results']
df = pd.DataFrame(pd.json_normalize(data))
And here are the columns of the DataFrame:
>>> print(df.columns)
Index(['attribute', 'bbox', 'featureName', 'layerName', 'type', 'id',
'geometry.type', 'geometry.crs', 'geometry.coordinates', # <-- the geometry has been flattened
'properties.resolution', 'properties.Year', 'properties.label'],
dtype='object')
Wanted behaviour
But I need to, let's say, "protect" the geometry object in the input JSON response against flattening so that I end up with these columns instead:
# e.g. something like this:
df = pd.DataFrame(pd.json_normalize(data, protect="results.geometry"))
# or this if there isn't two objects with the same name:
df = pd.DataFrame(pd.json_normalize(data, protect="geometry"))
which would lead to:
>>> print(df.columns)
Index(['attribute', 'bbox', 'featureName', 'layerName', 'type', 'id',
'geometry', 'properties.resolution', # <-- the geometry element has been protected!
'properties.Year', 'properties.label'],
dtype='object')
Is there a way of doing that properly?

Consider max_level=0. Per pandas.json_normalize docs:
max_level : int, default None
Max number of levels(depth of dict) to normalize. if None, normalizes all levels.
data = json.loads(response)["results"]
df = pd.DataFrame(pd.json_normalize(data, max_level=0))
print(df.T)
# 0
# geometry {'type': 'Polygon', 'crs': 4326, 'coordinates'...
# attribute layer.metadata
# bbox [6, 40, 7, 49]
# featureName Coniferous_Trees
# layerName State_Forests
# type Feature
# id 17
# properties {'resolution': '100', 'Year': '2020', 'label':...
print(df.columns)
# Index(['geometry', 'attribute', 'bbox', 'featureName', 'layerName', 'type', 'id',
# 'properties'], dtype='object')
And since all nested objects are not normalized, use data wrangling to unwind needed columns like properties:
df = (
df.drop(['properties'], axis="columns")
.join(df["properties"].dropna().apply(pd.Series))
)
print(df.T)
# 0
# geometry {'type': 'Polygon', 'crs': 4326, 'coordinates'...
# attribute layer.metadata
# bbox [6, 40, 7, 49]
# featureName Coniferous_Trees
# layerName State_Forests
# type Feature
# id 17
# resolution 100
# Year 2020
# label Coniferous
print(df.columns)
# Index(['geometry', 'attribute', 'bbox', 'featureName', 'layerName', 'type', 'id',
# 'resolution', 'Year', 'label'], dtype='object')

Related

python3 - check element is actually in list

for example i have excel header list like this
excel_headers = [
'Name',
'Age',
'Sex',
]
and i have another list to check againt it.
headers = {'Name' : 1, 'Age': 2, 'Sex': 3, 'Whatever': 4}
i dont care if headers have whatever elements, i care only element in headers has excel_headers element.
WHAT I've TRIED
lst = all(headers[idx][0] == header for idx,
header in enumerate(excel_headers))
print(lst)
however it always return False.
any help? pleasse
Another way to do it using sets would be to use set difference:
excel_headers = ['Name', 'Age', 'Sex']
headers = {'Name' : 1, 'Age': 2, 'Sex': 3, 'Whatever': 4}
diff = set(excel_headers) - set(headers)
hasAll = len(diff) == 0 # len 0 means every value in excel_headers is present in headers
print(diff) #this will give you unmatched elements
Just sort your list, the results shows you a before and after
excel_headers = [
'Name',
'Age',
'Sex',
]
headers = ['Age' , 'Name', 'Sex']
if excel_headers==headers: print "YES!"
else: print "NO!"
excel_headers.sort()
headers.sort()
if excel_headers==headers: print "YES!"
else: print "NO!"
Output:
No!
Yes!
Tip: this is a good use case for a set, since you're looking up elements by value to see if they exist. However, for small lists (<100 elements) the difference in performance isn't really noticeable, and using a list is fine.
excel_headers = ['Name', 'Age', 'Sex']
headers = {'Name' : 1, 'Age': 2, 'Sex': 3, 'Whatever': 4}
result = all(element in headers for element in excel_headers)
print(result) # --> True

How to create a list of non-empty dataframes names?

I have created a list of dataframe and want to run a loop through the list to run some manipulation on some of those dataframes. Note that although this is a list I have manually created, but this dataframes exist in my code.
df_list = [df_1, df_2, df_3, df_4, df_5, ...]
list_df_matching = []
list_non_matching = []
Most of these dataframes are blank. But 2 of them will have some records in them. I want to find the name of those dataframes and create a new list - list_non_matching
for df_name in df_list:
q = df_name.count()
if q > 0:
list_non_matching.append(df_name)
else:
list_df_matching.append(df_name)
My goal is to get a list of dataframe names like [df_4, df_10], but I am getting the following:
[DataFrame[id: string, nbr: string, name: string, code1: string, code2: string],
DataFrame[id: string, nbr: string, name: string, code3: string, code4: string]]
Is the list approach incorrect? Is there a better way of doing it?
Here is an example to illustrate one way to do it with the help of empty property and Python built-in function globals:
import pandas as pd
df1 = pd.DataFrame()
df2 = pd.DataFrame({"col1": [2, 4], "col2": [5, 9]})
df3 = pd.DataFrame(columns = ["col1", "col2"])
df4 = pd.DataFrame({"col1": [3, 8], "col2": [2, 0]})
df5 = pd.DataFrame({"col1": [], "col2": []})
df_list = [df1, df2, df3, df4, df5]
list_non_matching = [
name
for df in df_list
for name in globals()
if not df.empty and globals()[name] is df
]
print(list_non_matching)
# Output
['df2', 'df4']

Comparing columns with pandas_schema

I am using python 3.8 and Pandas_schema to run integrity checks on data. I have a requirement that workflow_next_step should never be the same as workflow_entry_step. I'm trying to generate a CustomSeriesValidation that compares both columns because I do not see a stock function that does this.
Is there a way to compare two cell values in the same row using Pandas_Schema? In this example, Pandas_Schema would return an error for Mary because she was moved from In Progress to In Progress.
df = config.pd.DataFrame({
'prospect': ['Bob', 'Jill', 'Steve', 'Mary'],
'value': [10000, 15000, 500, 50000],
'workflow_entry_step': ['New', 'In Progress', 'Closed', 'In Progress'],
'workflow_next_step': ['In Progress', 'Closed' ,None, 'In Progress']})
schema = Schema([
Column('prospect', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('value', [CanConvertValidation(int),'Doesn\'t convert to integer.']),
Column('workflow_entry_step', [InListValidation([None,'New','In Progress','Closed'])]),
Column('workflow_next_step', [CustomSeriesValidation(lambda x: x != Column('workflow_entry_step'), InListValidation([None,'New','In Progress','Closed'])]), 'Steps cannot be the same.')])
import pandas as pd
df = pd.DataFrame({
'prospect': ['Bob', 'Jill', 'Steve', 'Mary'],
'value': [10000, 15000, 500, 50000],
'workflow_entry_step': ['New', 'In Progress', 'Closed', 'In Progress'],
'workflow_next_step': ['In Progress', 'Closed' ,None, 'In Progress']})
schema = Schema([
Column('prospect', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
Column('value', [CanConvertValidation(float)]),
Column('workflow_entry_step', [InListValidation([None,'New','In
Progress','Closed'])]),
Column('workflow_next_step', [CustomSeriesValidation(lambda x: x !=
df['workflow_entry_step'], 'Steps cannot be the same.'),
InListValidation([None,'New','In Progress','Closed'])])
])
errors = schema.validate(df)
for error in errors:
print(error)
Output:
{row: 3, column: "workflow_next_step"}: "In Progress" Steps cannot be the same.

Creating dictionary from excel data

I have data in excel and need to create a dictionary for those data.
expected output like below:-
d = [
{
"name":"dhdn",
"usn":1bm15mca13",
"sub":["c","java","python"],
"marks":[90,95,98]
},
{
"name":"subbu",
"usn":1bm15mca14",
"sub":["java","perl"],
"marks":[92,91]
},
{
"name":"paddu",
"usn":1bm15mca17",
"sub":["c#","java"],
"marks":[80,81]
}
]
Tried code but it is working for only two column
import pandas as pd
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k,'sub':g["sub"].tolist(),"marks":g["marks"].tolist()} for k,g in df_service.groupby(['name', 'usn'])]
print (result)
I am getting like below but I want as I expected like above.
[{'name': ('dhdn', '1bm15mca13'), 'sub': ['c', 'java', 'python'], 'marks': [90, 95, 98]}, {'name': ('paddu', '1bm15mca17'), 'sub': ['c#', 'java'], 'marks': [80, 81]}, {'name': ('subbu', '1bm15mca14'), 'sub': ['java', 'perl'], 'marks': [92, 91]}]
Finally, I solved.
import pandas as pd
from pprint import pprint
existing_excel_file = 'BHARTI_Model-4_Migration Service parameters - Input sheet_v1.0_DRAFT_26-02-2020.xls'
df_service = pd.read_excel(existing_excel_file, sheet_name='Sheet2')
df_service = df_service.fillna(method='ffill')
result = [{'name':k[0],'usn':k[1],'sub':v["sub"].tolist(),"marks":v["marks"].tolist()} for k,v in df_service.groupby(['name', 'usn'])]
pprint (result)
It is giving expected output as I expected.
[{'marks': [90, 95, 98],
'name': 'dhdn',
'sub': ['c', 'java', 'python'],
'usn': '1bm15mca13'},
{'marks': [80, 81],
'name': 'paddu',
'sub': ['c#', 'java'],
'usn': '1bm15mca17'},
{'marks': [92, 91],
'name': 'subbu',
'sub': ['java', 'perl'],
'usn': '1bm15mca14'}]
All right! I solved your question although it took me a while.
The first part is the same as your progress.
import pandas as pd
df = pd.read_excel('test.xlsx')
df = df.fillna(method='ffill')
Then we need to get the unique names and how many rows they cover. I'm assuming there are as many unique names as there are unique "usn's". I created a list that stores these 'counts'.
unique_names = df.name.unique()
unique_usn = df.usn.unique()
counts = []
for i in range(len(unique_names)):
counts.append(df.name.str.count(unique_names[i]).sum())
counts
[3,2,2] #this means that 'dhdn' covers 3 rows, 'subbu' covers 2 rows, etc.
Now we need a smart function that will let us obtain the necessary info from the other columns.
def get_items(column_number):
empty_list = []
lower_bound = 0
for i in range(len(counts)):
empty_list.append(df.iloc[lower_bound:sum(counts[:i+1]),column_number].values.tolist())
lower_bound = sum(counts[:i+1])
return empty_list
I leave it to you to understand what is going on. But basically we are recovering the necessary info. We now just need to apply that to get a list for subs and for marks, respectively.
list_sub = get_items(3)
list_marks = get_items(2)
Finally, we put it all into one list of dicts.
d = []
for i in range(len(unique_names)):
diction = {}
diction['name'] = unique_names[i]
diction['usn'] = unique_usn[i]
diction['sub'] = list_sub[i]
diction['marks'] = list_marks[i]
d.append(diction)
And voilĂ !
print(d)
[{'name': 'dhdn', 'usn': '1bm15mca13', 'sub': [90, 95, 98], 'marks': ['c', 'java', 'python']},
{'name': 'subbu', 'usn': '1bm15mca14', 'sub': [92, 91], 'marks': ['java', 'perl']},
{'name': 'paddu', 'usn': '1bm15mca17', 'sub': [80, 81], 'marks': ['c#', 'java']}]

filter dataframe columns as you iterate through rows and create dictionary

I have the following table of data in a spreadsheet:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I'm reading the spreadsheet and passing the data as a dataframe.
I need to transform this table of data to json following a specific schema.
I have the following script:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row.to_dict())
which return:
{'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}
I want to be able to filter out a specific column. For example, to return this:
{'Name': 'bar', 'Description': 'foofoo'}
I know that I can print only the columns I want with this print(row['Name'],row['Description']) however this is only returning me values when I also want to return the key.
How can I do this?
I wrote this entire thing only to realize that #anky_91 had already suggested it. Oh well...
import pandas as pd
data = {
"name": ["foo", "abc", "baz", "bar"],
"description": ["foobar", "foofoo", "foobaz", "foofoo"],
"value": [5, 3, 4, 8],
}
df = pd.DataFrame(data=data)
print(df, end='\n\n')
rec_dicts = df.loc[df["description"] == "foofoo", ["name", "description"]].to_dict(
"records"
)
print(rec_dicts)
Output:
name description value
0 foo foobar 5
1 abc foofoo 3
2 baz foobaz 4
3 bar foofoo 8
[{'name': 'abc', 'description': 'foofoo'}, {'name': 'bar', 'description': 'foofoo'}]
After converting to dictionary you can delete the key which you don't need with:
del(row[value])
Now the dictionary will have only name and description.
You can try this:
import io
import pandas as pd
s="""Name,Description,Value
foo,foobar,5
baz,foobaz,4
bar,foofoo,8
"""
df = pd.read_csv(io.StringIO(s))
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row[['Name', 'Description']].to_dict())
Result:
{'Name': 'bar', 'Description': 'foofoo'}

Resources