How to read multiple levels of JSON file through Python pandas? - python-3.x

I am trying to read a JSON file where the data are at various level, i.e. Top --> inner --> inner most.
I have tried the pd.json_normalization, but I don't think it is working. I have attached a screenshot. In that the top most level is "WTFY_Combined", and inside it there are three more levels of data. So, out of the levels, I need to read "OccuMa" which is marked in yellow color, and then inside "OccuMa", there are another level of data "OccuCode" and "OccuDesc". I need to read those two levels in two different dataframes.
I know that one way of doing is to take those two in two different JSON files, but in real, I will have such multi level structure to read.
I am trying below code:
import pandas as pd
import json as js
with open ("filepath", "r") as f:
data = js.loads(f.read())
df_flat = pd.json_normalize(data, record_path=['OccuCode'])
df_flat2 = pd.json_normalize(data, record_path=['OccuDesc'])
But, it is not working, its giving "keyerror" for obvious reason that I am not able to map the data into the dataframe properly.

import pandas as pd
import json as js
with open ("filepath", "r") as f:
json_data = js.loads(f.read())
json_data = json_data['WTYF_Combined']['Extraction']['abc']['OccuMa']
df_data = pd.DataFrame(json_data)[['OccuCode','OccuDesc']]

Related

siphon error - 400: NETCDF4 format not supported for ANY_POINT feature typ

I'm trying to get a dataset from TDScatalog with siphon but with multiples variables show me that error or the last line. Here the code:
import siphon
from siphon.catalog import TDSCatalog
import datetime
from xarray.backends import NetCDF4DataStore
import xarray as xr
point = [-7, 41]
hours = 48
best_gfs = TDSCatalog('http://thredds.ucar.edu/thredds/catalog/grib/NCEP/GFS/'
'Global_0p25deg/catalog.xml?dataset=grib/NCEP/GFS/Global_0p25deg/Best')
best_gfs.datasets
best_ds = list(best_gfs.datasets.values())[0]
ncss = best_ds.subset()
query = ncss.query()
query.lonlat_point( point[1], point[0] ).time_range(datetime.datetime.utcnow(), datetime.datetime.utcnow() + datetime.timedelta(hours))
query.accept('netcdf4')
query.variables('Temperature_surface',
'Relative_humidity_height_above_ground',
'u-component_of_wind_height_above_ground',
'v-component_of_wind_height_above_ground',
'Wind_speed_gust_surface'
)
data = ncss.get_data(query)
Thanks!
That message is because your point request is trying to return a mix of time series (your _surface variables) and time series of profiles (the u/v wind components). The combination of different features in a single netCDF file is unsupported by the netCDF CF-Conventions.
One work-around is to request CSV or XML formatted data instead (which siphon can still parse and return as a dictionary-of-arrays).
The other is to make separate requests for fields with different geometry. So one for Temperature_surface and Wind_speed_gust_surface, one for u-component_of_wind_height_above_ground and v-component_of_wind_height_above_ground, and one final one for Relative_humidity_height_above_ground. This last split is working around an apparent bug in the THREDDS Data Server where profiles with different vertical levels can't be combined either.

Accessing columns in a csv using python

I am trying to access data from a CSV using python. I am able to access entire columns for data values; however, I want to also access rows, an use like and indexed coordinate system (0,1) being column 0, row 1. So far I have this:
#Lukas Robin
#25.07.2021
import csv
with open("sun_data.csv") as sun_data:
sunData = csv.reader(sun_data, delimiter=',')
global data
for data in sunData:
print(data)
I don't normally use data tables or CSV, so this is a new area for me.
As mentioned in the comment, you could make the jump to using pandas and spend a little time learning that. It would be a good investment of time if you plan to do much data analysis or work with data tables regularly.
If you just want to pull in a table of numbers and access it as you request, you are perfectly fine using csv package and doing that. Below is an example...
If your .csv file has a header in it, you can simply add in next(sun_data) before starting the inner loop to advance the iterator and let that data fall on the floor...
import csv
f_in = 'data_table.csv'
data = [] # a container to hold the results
with open(f_in, 'r') as source:
sun_data = csv.reader(source, delimiter=',')
for row in sun_data:
# convert the read-in values to float data types (or ints or ...)
row = [float(t) for t in row]
# append it to the data table
data.append(row)
print(data[1][0])

Stuck using pandas to build RPG item generator

I am trying to build a simple random item generator for a game I am working on.
So far I am stuck trying to figure out how to store and access all of the data. I went with pandas using .csv files to store the data sets.
I want to add weighted probabilities to what items are generated so I tried to read the csv files and compile each list into a new set.
I got the program to pick a random set but got stuck when trying to pull a random row from that set.
I am getting an error when I use .sample() to pull the item row which makes me think I don't understand how pandas works. I think I need to be creating new lists so I can later index and access the various statistics of the items once one is selected.
Once I pull the item I was intending on adding effects that would change the damage and armor and such displayed. So I was thinking of having the new item be its own list then use damage = item[2] + 3 or whatever I need
error is: AttributeError: 'list' object has no attribute 'sample'
Can anyone help with this problem? Maybe there is a better way to set up the data?
here is my code so far:
import pandas as pd
import random
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
def get_item():
item_class = [random.choices(df, weights=(45,40,15), k=1)] #this part seemed to work. When I printed item_class it printed one of the entire lists at the correct odds
item = item_class.sample()
print (item) #to see if the program is working
get_item()
I think you are getting slightly confused with lists vs list elements. This should work. I stubbed your dfs with simple ones
import pandas as pd
import random
# Actual data. Comment it out if you do not have the csv files
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
# My stubs -- uncomment and use this instead of the line above if you want to run this specific example
# df = [pd.DataFrame({'weapons' : ['w1','w2']}), pd.DataFrame({'armor' : ['a1','a2', 'a3']}), pd.DataFrame({'aether' : ['e1','e2', 'e3', 'e4']})]
def get_item():
# I removed [] from the line below -- choices() already returns a list of length 1
item_class = random.choices(df, weights=(45,40,15), k=1)
# I added [0] to choose the first element of item_class which is a list of length 1 from the line above
item = item_class[0].sample()
print (item) #to see if the program is working
get_item()
prints random rows from random dataframes that I setup such as
weapons
1 w2

Combining multiple data rows on key lookup

So I am trying to combine multiple CSV files. I have one csv with a current part number list of products we stock. Sorry, I can't embedded images as I am new. I've seen many similar posts but not any with both a merge and a groupby together.
current_products
I have another csv with a list of image files that are associated with that part but are split up on to multiple rows. This list also has many more parts listed than we offer so merging based on the current_products sku is important.
product_images
I would like to reference the first csv for parts I currently use and combine the images files in the following format.
newestproducts
I get a AttributeError: 'function' object has no attribute 'to_csv', although when I just print the output in the terminal it appears to be the way I want it.
current_products = 'currentproducts.csv'
product_images = 'productimages.csv'
image_list = 'newestproducts.csv'
df_currentproducts = pd.read_csv(currentproducts)
df_product_images = pd.read_csv(product_images)
df_current_products['sku'] = df_currentproducts['sku'].astype(str)
df_product_images['sku'] = df_product_images['sku'].astype(str)
df_merged = pd.merge(df_current_products, df_product_images[['sku','images']], on = 'sku', how='left')
df_output = df_merged.groupby(['sku'])['images_y'].apply('&&'.join).reset_index
#print(df_output)
df_output.to_csv(image_list, index=False)
Your are missing () after reset_index:
df_output = df_merged.groupby(['sku']['images_y'].apply('&&'.join).reset_index()
That resulted df_output type to method rather then a dataframe (just print type(df_output) to see that), so obviously he doesn't know any method named to_csv

Writing multiple columns into csv files using python

I am a python beginner. I am trying to write multiple lists into separate columns in a csv file.
In my csv file, I would like to have
2.9732676520000001 0.0015852047556142669 1854.1560636319559
4.0732676520000002 0.61902245706737125 2540.1258143280334
4.4032676520000003 1.0 2745.9167395368572
Following is the code that I wrote.
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
with open('output/'+file,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
This isn't running properly. I'm getting non-empty format string passed to object.format this error message. I'm having a hard time to catch what is going wrong. Could anyone spot what's going wrong with my code?
You are better off using pandas
import pandas as pd
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
# pandas can convert a list of lists to a dataframe.
# each list is a row thus after constructing the dataframe
# transpose is applied to get to the user's desired output.
df = pd.DataFrame([df, int_peak, CS])
df = df.transpose()
# write the data to the specified output path: "output"/+file_name
# without adding the index of the dataframe to the output
# and without adding a header to the output.
# => these parameters are added to be fit the desired output.
df.to_csv("output/"+file_name, index=False, header=None)
The output CSV looks like this:
2.973268 0.001585 1854.156064
4.073268 0.619022 2540.125814
4.403268 1.000000 2745.916740
However to fix your code, you need to use another file name variable other than file. I changed that in your code as follows:
df=[2.9732676520000001, 4.0732676520000002, 4.4032676520000003]
CS=[1854.1560636319559, 2540.1258143280334, 2745.9167395368572]
int_peak=[0.0015852047556142669, 0.61902245706737125, 1.0]
file_name = "your_file_name.csv"
with open('/tmp/'+file_name,'w') as f:
for dt,int_norm,CSs in zip(df,int_peak,CS):
f.write('{0:f},{1:f},{2:f}\n'.format(dt,int_norm,CSs))
and it works. The output is as follows:
2.973268,0.001585,1854.156064
4.073268,0.619022,2540.125814
4.403268,1.000000,2745.916740
If you need to write only a few selected columns to CSV then you should use following option.
csv_data = df.to_csv(columns=['Name', 'ID'])

Resources