Convert list of Pandas Dataframe JSON objects - python-3.x

I have a Dataframe with one column where each cell in the column is a JSON object.
players
0 {"name": "tony", "age": 57}
1 {"name": "peter", age": 46}
I want to convert this to a data frame as:
name age
tony 57
peter 46
Any ideas how I do this?
Note: the original JSON object looks like this...
{
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}

Use DataFrame constructor if types of values are dicts:
#print (type(df.loc[0, 'players']))
#<class 'str'>
#import ast
#df['players'] = df['players'].apply(ast.literal_eval)
print (type(df.loc[0, 'players']))
<class 'dict'>
df = pd.DataFrame(df['players'].values.tolist())
print (df)
age name
0 57 tony
1 46 peter
But better is use json_normalize from jsons object as suggested #jpp:
json = {
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
df = json_normalize(json, 'players')
print (df)
age name
0 57 tony
1 46 peter

This can do it:
df = df['players'].apply(pd.Series)
However, it's slow:
In [20]: timeit df.players.apply(pd.Series)
1000 loops, best of 3: 824 us per loop
#jezrael suggestion is faster:
In [24]: timeit pd.DataFrame(df.players.values.tolist())
1000 loops, best of 3: 387 us per loop

Related

renaming data in a dataframe using config json

I have a situation where I need to change some junk data in json such as
'a' need to be A
'B' need to be B
I want to create a config json which shall have a dictionary where the key & value should look like
dict={'a':'A', 'b':'B'}
And then access the json in another python file which reads data from a dataframe where those junk values(keys of the dictionary) are there & change them to the correct ones(values of the dictionary). Can anyone help..?
So, given the following config.json file:
{
"junk1": "John",
"junk2": "Jack",
"junk3": "Tom",
"junk4": "Butch"
}
You could have the following python file in the same directory:
import pandas as pd
import json
with open("config.json", "r") as f:
cfg = json.load(f)
df = pd.DataFrame(
{
"class": {
0: "class1",
1: "class2",
2: "class3",
3: "class4",
},
"firstname": {0: "junk1", 1: "junk2", 2: "junk3", 3: "junk4"},
}
)
print(df)
# Outputs
class firstname
0 class1 junk1
1 class2 junk2
2 class3 junk3
3 class4 junk4
And then do:
df["firstname"] = df["firstname"].replace(cfg)
print(df)
# Outputs
class firstname
0 class1 John
1 class2 Jack
2 class3 Tom
3 class4 Butch

Pandas groupby specific range

I am trying to use groupby to group a dataset by a specific range. In the following dataframe, I'd like to groupby the max_speed which is above 150 and also count the number of items that are above 150.
An example dataset is as follows:
df = pd.DataFrame(
[
("bird", "Falconiformes", 250.0),
("bird", "Psittaciformes", 250.0),
("mammal", "Carnivora", 180.2),
("mammal", "Primates", 159.0),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),
)
What I've tried is:
df[['max_speed'] > 150].groupby('max_speed').size()
Expected output:
max_speed count
180.2 1
250.0 2
159.0 1
How can I do this?
Thank you

drop duplicated and concat pandas

I have a dataframe that looks like this:
'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CZ10", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': ["bonjour", "bonjour", "bonjour", "hola", "Hello", None]
Result:
id date code col_example .... comments
0 1 2019 CB25/CZ10 22 .... bonjour (and not bonjour // bonjour)
1 2 2011 CD15 None .... bonjour
2 3 2017 None 55 .... hola // Hello
3 4 2018 AZ51 121 .... None
I want to keep a single id
If two ids are the same, I would like:
If comments = None and = str: Keep only the comments which are not None (example: id = 1, keep the comments "hello")
If two comments = str: Concaten the two comments with a "//" (example id = 3, comments = "hola // hello")
For the moment I tried with sort_value, and drop_duplicate without success
thank you
I believe you need DataFrame.dropna by column comments and then GroupBy.agg with join and GroupBy.last, last add DataFrame.mask for replace empty strings to None rows:
df1 = (df.groupby('id')
.agg({'date': 'last',
'comments': lambda x: ' // '.join(x.dropna())})
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id date comments
0 1 2019 bonjour
1 2 2011 bonjour
2 3 2017 hola // Hello
3 4 2018 None
EDIT: For avoid removed all columns is necessary aggregate all of them, you can create dictionary for aggregation dynamic like:
df = pd.DataFrame({'id': ["1", "2", "1", "3", "3", "4"],
'date': ["2017", "2011", "2019", "2013", "2017", "2018"],
'code': ["CB25", "CD15", "CB25", None, None, "AZ51"],
'col_example': ["22", None, "22", "55", "55", "121"],
'comments': [None, "bonjour", "bonjour", "hola", "Hello", None]})
print (df)
id date code col_example comments
0 1 2017 CB25 22 None
1 2 2011 CD15 None bonjour
2 1 2019 CB25 22 bonjour
3 3 2013 None 55 hola
4 3 2017 None 55 Hello
5 4 2018 AZ51 121 None
d = dict.fromkeys(df.columns.difference(['id','comments']), 'last')
d['comments'] = lambda x: ' // '.join(x.dropna())
print (d)
{'code': 'last', 'col_example': 'last', 'date': 'last',
'comments': <function <lambda> at 0x000000000ECA99D8>}
df1 = (df.groupby('id')
.agg(d)
.replace({'comments': {'': None}})
.reset_index())
print (df1)
id code col_example date comments
0 1 CB25 22 2019 bonjour
1 2 CD15 None 2011 bonjour
2 3 None 55 2017 hola // Hello
3 4 AZ51 121 2018 None

filter dataframe columns as you iterate through rows and create dictionary

I have the following table of data in a spreadsheet:
Name Description Value
foo foobar 5
baz foobaz 4
bar foofoo 8
I'm reading the spreadsheet and passing the data as a dataframe.
I need to transform this table of data to json following a specific schema.
I have the following script:
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row.to_dict())
which return:
{'Name': 'bar', 'Description': 'foofoo', 'Value': '8'}
I want to be able to filter out a specific column. For example, to return this:
{'Name': 'bar', 'Description': 'foofoo'}
I know that I can print only the columns I want with this print(row['Name'],row['Description']) however this is only returning me values when I also want to return the key.
How can I do this?
I wrote this entire thing only to realize that #anky_91 had already suggested it. Oh well...
import pandas as pd
data = {
"name": ["foo", "abc", "baz", "bar"],
"description": ["foobar", "foofoo", "foobaz", "foofoo"],
"value": [5, 3, 4, 8],
}
df = pd.DataFrame(data=data)
print(df, end='\n\n')
rec_dicts = df.loc[df["description"] == "foofoo", ["name", "description"]].to_dict(
"records"
)
print(rec_dicts)
Output:
name description value
0 foo foobar 5
1 abc foofoo 3
2 baz foobaz 4
3 bar foofoo 8
[{'name': 'abc', 'description': 'foofoo'}, {'name': 'bar', 'description': 'foofoo'}]
After converting to dictionary you can delete the key which you don't need with:
del(row[value])
Now the dictionary will have only name and description.
You can try this:
import io
import pandas as pd
s="""Name,Description,Value
foo,foobar,5
baz,foobaz,4
bar,foofoo,8
"""
df = pd.read_csv(io.StringIO(s))
for index, row in df.iterrows():
if row['Description'] == 'foofoo':
print(row[['Name', 'Description']].to_dict())
Result:
{'Name': 'bar', 'Description': 'foofoo'}

Flatting Pandas JSON Dataframe for a specific path

I have the following JSON
ds = [{
"name": "groupA",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": {
"value": 10
}
},
{
"category2": {}
},
{
"category3": {}
}
]
}]
},
{
"name": "groupB",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": {
"value": 500
}
},
{
"category2": {}
},
{
"category3": {}
}
]
}]
}]
I can get a dataframe for all the categories by doing:
json_normalize(ds, record_path=["subGroups", "categories"], meta=['name', ['subGroups', 'subGroup']], record_prefix='cat.')
This will give me:
cat.category1 cat.category2 cat.category3 subGroups.subGroup name
0 {'value': 10} NaN NaN 1 groupA
1 NaN {} NaN 1 groupA
2 NaN NaN {} 1 groupA
3 {'value': 500} NaN NaN 1 groupB
4 NaN {} NaN 1 groupB
5 NaN NaN {} 1 groupB
But, I don't care about category 2 and category 3 at all. I only care about the category 1.
So'd I prefer something like:
cat.category1 subGroups.subGroup name
0 {'value': 10} 1 groupA
1 {'value': 500} 1 groupB
Any ideas how I get to this?
And even better, I really want the value of value in category1. So something like:
cat.category1.value subGroups.subGroup name
0 10 1 groupA
1 500 1 groupB
Any ideas?
The problem is that category1 is not considered a record by json_normalize. An informal definition of record is a key in a dictionary that maps to an list of dicts. You can't access category1 (and therefore value) through record_path argument because it doesn't map to an list of dicts.
This is the best solution I could find:
import pandas as pd
df = pd.io.json.json_normalize(ds,
record_path=['subGroups', 'categories'],
errors='ignore',
meta=['name',
['subGroups', 'subGroup'],
],
record_prefix='cat.')
df = df.drop(['cat.category2', 'cat.category3'], axis=1)
for i in range(df.shape[0]):
row = df.at[i, 'cat.category1']
if isinstance(row, dict) and 'value' in row:
df.at[i, 'cat.category1'] = row['value']
else:
df.at[i, 'cat.category1'] = np.nan
# EDIT: if you want to remove rows for which cat.category1 column has NAN values
df = df[pd.notnull(df['cat.category1'])]
Output of df is the desired form of the dataframe.
On the other hand, if your JSON structure looked like this (notice the list brackets around the value dict):
ds = [{
"name": "groupA",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": [{
"value": 10
}]
}]
}]
},
{
"name": "groupB",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": [{
"value": 500
}]
}]
}]
}]
You would be able to use json_normalize like this:
df = pd.io.json.json_normalize(ds,
record_path=['subGroups', 'categories', 'category1'],
errors='ignore',
meta=['name',
['subGroups', 'subGroup'],
],
record_prefix='cat.')
And you would get this:
cat.value name subGroups.subGroup
10 groupA 1
500 groupB 1
Try using YAML for this purpose it has yaml dump to write output in a human readable format and other functions to rewrite the output in json.
Check the basic video tutorial here :
https://www.youtube.com/watch?v=hSuHnuNC8L4

Resources