I have the following JSON
ds = [{
"name": "groupA",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": {
"value": 10
}
},
{
"category2": {}
},
{
"category3": {}
}
]
}]
},
{
"name": "groupB",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": {
"value": 500
}
},
{
"category2": {}
},
{
"category3": {}
}
]
}]
}]
I can get a dataframe for all the categories by doing:
json_normalize(ds, record_path=["subGroups", "categories"], meta=['name', ['subGroups', 'subGroup']], record_prefix='cat.')
This will give me:
cat.category1 cat.category2 cat.category3 subGroups.subGroup name
0 {'value': 10} NaN NaN 1 groupA
1 NaN {} NaN 1 groupA
2 NaN NaN {} 1 groupA
3 {'value': 500} NaN NaN 1 groupB
4 NaN {} NaN 1 groupB
5 NaN NaN {} 1 groupB
But, I don't care about category 2 and category 3 at all. I only care about the category 1.
So'd I prefer something like:
cat.category1 subGroups.subGroup name
0 {'value': 10} 1 groupA
1 {'value': 500} 1 groupB
Any ideas how I get to this?
And even better, I really want the value of value in category1. So something like:
cat.category1.value subGroups.subGroup name
0 10 1 groupA
1 500 1 groupB
Any ideas?
The problem is that category1 is not considered a record by json_normalize. An informal definition of record is a key in a dictionary that maps to an list of dicts. You can't access category1 (and therefore value) through record_path argument because it doesn't map to an list of dicts.
This is the best solution I could find:
import pandas as pd
df = pd.io.json.json_normalize(ds,
record_path=['subGroups', 'categories'],
errors='ignore',
meta=['name',
['subGroups', 'subGroup'],
],
record_prefix='cat.')
df = df.drop(['cat.category2', 'cat.category3'], axis=1)
for i in range(df.shape[0]):
row = df.at[i, 'cat.category1']
if isinstance(row, dict) and 'value' in row:
df.at[i, 'cat.category1'] = row['value']
else:
df.at[i, 'cat.category1'] = np.nan
# EDIT: if you want to remove rows for which cat.category1 column has NAN values
df = df[pd.notnull(df['cat.category1'])]
Output of df is the desired form of the dataframe.
On the other hand, if your JSON structure looked like this (notice the list brackets around the value dict):
ds = [{
"name": "groupA",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": [{
"value": 10
}]
}]
}]
},
{
"name": "groupB",
"subGroups": [{
"subGroup": 1,
"categories": [{
"category1": [{
"value": 500
}]
}]
}]
}]
You would be able to use json_normalize like this:
df = pd.io.json.json_normalize(ds,
record_path=['subGroups', 'categories', 'category1'],
errors='ignore',
meta=['name',
['subGroups', 'subGroup'],
],
record_prefix='cat.')
And you would get this:
cat.value name subGroups.subGroup
10 groupA 1
500 groupB 1
Try using YAML for this purpose it has yaml dump to write output in a human readable format and other functions to rewrite the output in json.
Check the basic video tutorial here :
https://www.youtube.com/watch?v=hSuHnuNC8L4
Related
I have a json file with list of dicts. I want to modify its content by adding key:value in every dict with index as the value
Note that the json file is malform, so I need to remove the extra '[]'
file.json
[
{
"sample1": 1,
"sample2": "value"
}
[]
{
"sampleb": "123",
"some": "some"
}
...............
]
code
""" open the files"""
with open("list1.json", "r") as f:
data = f.read()
data = data.replace("][", ",")
data = json.loads(data)
for v in data:
for i, c in v.items():
c["rank"] = i + 1
""" put back to file"""
with open("list1.json", "w") as file:
file.write(data)
So what I am trying to achieve is something like
[
{
"rank": 1
"sample1": 1,
"sample2": "value"
},
{
"rank": 2,
"sampleb": "123",
"some": "some"
}
...............
]
But I got error
c["rank"] = i
TypeError: 'str' object does not support item assignment
printing the index print 1 shows
0,
1,
.....
0,
1,
........
0,
1
But it should be
0,
1,
2,
3,
4
5
...
100
Any ideas?
I have been struggling with this for hours and I feel like crying now as I'm unable to fathom out what is happening.
Here is a simplified version of my data:
{
"first_name": {
"0": "OBELISK",
"1": "RA" }, "golongan": {
"0": 88,
"1": 99 }, "last_name": {
"0": "GOD",
"1": "GOD" }, "nik": {
"0": 666679,
"1": 666678 }, "status_aktif": {
"0": 1,
"1": 1 }, "tgl_kerja": {
"0": "Sat, 20 Nov 2021 16:28:00 GMT",
"1": "Thu, 25 Nov 2021 16:28:00 GMT" } }
This is the code I have :
app.route('/karyawan-import-excel')
def import_excel():
# df = pd.read_csv(r'C:\xampp\htdocs\python\coba_read_write_excel\testing.csv')
df = pd.read_excel(r'C:\xampp\htdocs\python\coba_read_write_excel\testing.xlsx')
data = df.to_dict()
cur = mysql.connection.cursor(curMysql)
sql = "INSERT INTO zzz_customers (name, address) VALUES (%s, %s)"
val = [
(data.get('nik'), data.get('first_name'))
# ('Peter', 'Lowstreet 4'),
# ('Amy', 'Apple st 652')
]
cur.executemany(sql, val )
mysql.connection.commit()
The data structure you have created by doing data = df.to_dict() is a nested dictionary. There's not a straightforward way that I know of to get a data structure like that into a MySQL table.
Instead you can make this small change to that line, as shown below, which will give you a list of dicts -- an easier data structure to work with which you can then insert into a MySQL table.
data = df.to_dict('records') # produces list of dicts
Then you can "get" -- as you were trying to do -- just the values you want into a list of tuples:
list_of_tuples = [
(d['nik'], d['first_name']) for d in data
]
cur.executemany(sql, list_of_tuples)
# ...
This is what list_of_tuples should look like:
In [11]: list_of_tuples
Out[11]: [(666679, 'OBELISK'), (666678, 'RA')]
I have a nested dictionary which comprises of multiple lists and dictionaries. The "Stations" key contains contains the values which I want to convert to CSV file. I am only after the certain values. A snippet of the dictionary is as below:
data = { "brands": {...},
"fueltypes": {...},
"stations": {"items": [
{
"brandid": "",
"stationid": "",
"brand": "Shell",
"code": "2126",
"name": "Cumnock General Store",
"address": "31 Obley St, CUMNOCK NSW 2867",
"location": {
"latitude": -32.928744,
"longitude": 148.755153
},
"state": "NSW"
},
{
"brandid": "",
"stationid": "",
"brand": "Shell",
"code": "2200",
"name": "Tea Tree Cafe",
"address": "160 Mount Darragh Rd, SOUTH PAMBULA NSW 2549",
"location": {
"latitude": -36.944277,
"longitude": 149.845399
},
"state": "NSW"
}....]}}
In order to obtain certain values in "Stations" key, I created blank lists for each of those values and appended accordingly. After that I used the ZIP function to combine the list and converted to a CSV. The Code that I have used is as below:
Station_Code = []
Station_Name = []
Latitude = []
Longitude = []
Address = []
Brand = []
for k,v in data["stations"].items():
for item in range(len(v)):
Station_Code.append(v[item]["code"])
Station_Name.append(v[item]["name"])
Latitude.append(v[item]["location"]["latitude"])
Longitude.append(v[item]["location"]["longitude"])
Address.append(v[item]["address"])
Brand.append(v[item]["brand"])
#print(f'{v[item]["code"]} - {v[item]["name"]} - {v[item]["location"]["latitude"]}')
rows = zip(Station_Code, Station_Name, Latitude, Longitude, Address, Brand )
with open("Exported_File.csv", "w") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
Is there any other alternate/short ways of extracting this information?
If you're using pandas, there's a fairly easy way to do this.
import pandas as pd
# Convert dict to a pandas DataFrame
df = pd.DataFrame(data["stations"]["items"])
# 'location' is a dict, so we need to extract the 'latitude' and 'longitude'.
df['latitude'] = df['location'].apply(lambda x: x['latitude'])
df['longitude'] = df['location'].apply(lambda x: x['longitude'])
# Select subset of columns for final csv
df = df[['code', 'name', 'latitude', 'longitude', 'address', 'brand']]
df.to_csv('exported-file.csv', index=False, header=False)
I have a Dataframe with one column where each cell in the column is a JSON object.
players
0 {"name": "tony", "age": 57}
1 {"name": "peter", age": 46}
I want to convert this to a data frame as:
name age
tony 57
peter 46
Any ideas how I do this?
Note: the original JSON object looks like this...
{
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
Use DataFrame constructor if types of values are dicts:
#print (type(df.loc[0, 'players']))
#<class 'str'>
#import ast
#df['players'] = df['players'].apply(ast.literal_eval)
print (type(df.loc[0, 'players']))
<class 'dict'>
df = pd.DataFrame(df['players'].values.tolist())
print (df)
age name
0 57 tony
1 46 peter
But better is use json_normalize from jsons object as suggested #jpp:
json = {
"players": [{
"age": 57,
"name":"tony"
},
{
"age": 46,
"name":"peter"
}]
}
df = json_normalize(json, 'players')
print (df)
age name
0 57 tony
1 46 peter
This can do it:
df = df['players'].apply(pd.Series)
However, it's slow:
In [20]: timeit df.players.apply(pd.Series)
1000 loops, best of 3: 824 us per loop
#jezrael suggestion is faster:
In [24]: timeit pd.DataFrame(df.players.values.tolist())
1000 loops, best of 3: 387 us per loop
I have this [3526 rows x 5 columns] DF, where col0 is time, col1-col3 are tags and col4 is my value.
0 1 2 3 4
0 2017-09-29 22:41:51 10.2.95.5 C1195_LF470_SARF 0.0.1.1 11993
1 2017-09-29 22:41:37 10.2.52.7 CF643_RCZ70_SARM 0.0.1.16 12102
2 2017-09-29 22:41:39 10.2.102.7 C1345_BQS70_SARF 0.0.1.17 18173
3 2017-09-29 22:41:41 10.2.23.212 CN165_FS470_SAR8 0.0.0.7 23525
4 2017-09-29 22:41:38 10.2.96.4 CF832_UY570_SARM 0.0.1.4 6162
So, I want to write that DF into influxdb. I'll do ...
timeValues = df[ ['col0','col4'] ]
tags = { 'col1': df[['col1']], 'col2': df[['col2']], 'col3':df[['col3']] }
dbConnDF = DataFrameClient(dbAddress, dbPort, dbUser, dbPassword, dbName)
dbConnDF.write_points(dbName, tbName, timeValues, tags = tags)
After that, I get the error
Must be DataFrame with Datetime or PeriodIndex
However, if I do insert row by row using this...
dbConnQRY = InfluxDBClient(dbAddress, dbPort, dbUser, dbPassword, dbName)
dbConnQRY.write_points(bodyDB)
where:
bodyDB = [{
"measurement": tbName,
"tags":
{
"col1": col1,
"col2": col2,
"col3": col3
},
"time": col0,
"fields":
{
"col4": col4
}
}]
... I get no error at all. So the problem appears when I try to insert the whole DF at once.
How do I tell influxdb that col0 is my index to avoid the error?
Thanks!
Create an index column for dataframe
timeValues = df[ ['col4'] ]
timeValues.index = df[ ['col0'] ]
followed by
dbConnDF = DataFrameClient(dbAddress, dbPort, dbUser, dbPassword, dbName)
dbConnDF.write_points(dbName, tbName, timeValues, tags = tags)
That should solve the indexing problem.