Formatting json data to convert to xslx - python-3.x

I am downloading data in json format in Python 3.7 and trying to display it as an Excel spreadsheet. I tried using pandas, but I think that there is a problem in getting the data into a dataframe. I also tried a csv approach but that was not successful.
I am new to learning python, so maybe there is something obvious I am missing.
import json, requests,urllib,csv
import pandas as pd
url = urllib.request.urlopen('https://library.brown.edu/search/solr_pub/iip/?start=0&rows=100&indent=on&wt=json&q=*')
str_response = url.read().decode('utf-8')
jsondata=json.loads(str_response)
df = pd.DataFrame(jsondata)
I was hoping to get a number of rows for each item, e.g., zoor0353, with the columns for each of the keys associated with it (e.g., region, date_desc, etc. -there are quite a few). Instead, it seemed only to take the first section, returning:
responseHeader \
QTime 1
docs NaN
numFound NaN
params {'q': '*', 'indent': 'on', 'start': '0', 'rows...
start NaN
status 0
response
QTime NaN
docs [{'inscription_id': 'zoor0353', 'metadata': ['...
numFound 4356
params NaN
start 0
status NaN
I tried this with normalization method, but did no better. Ultimately, I would like to use a dataset made of an appended file of many calls to this api, and am also wondering if I will need to manipulate data, and how, to get it to work with pandas.

Since I'm not sure what exactly you need to get, I'll show you
how to use pandas.io.json.json_normalize
how to pick extract key/values
json_normalize
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://library.brown.edu/search/solr_pub/iip/?start=0&rows=100&indent=on&wt=json&q=*'
r = requests.get(url).json()
df = json_normalize(r['response']['docs']).T
This is a sample output; each column is a separate item of r['response']['docs'] containing values, and rows are the keys for said values.
Data are all messed up, and also some of the keys you probably don't want to use. That's why I think it's better to extract specific keys and values.
df = pd.DataFrame()
query = r['response']['docs']
for i in range(len(query)):
records = [
query[i]['inscription_id'], query[i]['city'], query[i]['_version_'],
query[i]['type'][0], query[i]['language_display'][0], query[i]['physical_type'][0]
]
# recs.append(records)
data = pd.DataFrame.from_records([records], columns=['inscription_id', 'city', '_version_', 'type', 'language_display', 'physical_type'])
df = df.append(data)
# sample data
inscription_id city _version_ type language_display physical_type
0 unkn0103 Unknown 1613822149755142144 prayer Hebrew other_object
0 zoor0391 Zoora 1613822173978296320 funerary.epitaph Greek tombstone
0 zoor0369 Zoora 1613822168079007744 funerary.epitaph Greek tomb
0 zoor0378 Zoora 1613822170509606912 funerary.epitaph Greek tombstone
0 zoor0393 Zoora 1613822174648336384 funerary.epitaph Greek tombstone
edit
query=r[][] - what is it and how can I learn more about it?
r = requests.get(url).json()
by using .json() at the end of requests.get method, I converted the response object (data from the API) to a dict type. Hence, r is now a dictionary containing keys:
r.keys()
# dict_keys(['responseHeader', 'response'])
to access values of a specific key, I use a format: dictionary[key][key_inside_of_another_key]:
r['response'].keys()
# dict_keys(['numFound', 'start', 'docs'])
r['response']['docs'][0]
# sample output
{'inscription_id': 'zoor0353',
'metadata': ['zoor0353',
'Negev',
'Zoora',
'Zoora, Negev. cemetery in An Naq. \nNegev.\n Zoora. Found by local inhabitants in the northwest corner of the Bronze\n Age, Byzantine and Islamic cemetery in the An Naq neighborhood south of\n the Wadi al-Hasa, probably in secondary use in later graves. \n',
'funerary.epitaph',
...
...
...
# [0] here is an item inside of a key - in the case of your data,
# there were 100 such items: [0, 1, .. 99] inside of r['response']['docs']
# to access item #4, you'd use ['response']['docs'][4]
That's how you navigate through a dictionary. Now, to access specific key, say inscription_id or _version_:
r['response']['docs'][0]['inscription_id']
# 'zoor0353'
r['response']['docs'][99]['inscription_id']
# 'mger0001'
r['response']['docs'][33]['_version_']
# 1613822151126679552
Lastly, to iterate through all rows, or items of data, I used a for loop: i here is a substitude for a range of numbers representing each item of your data -- from r['response']['docs'][1] to r['response']['docs'][99].

Related

Iterating over an API response breaks for only one column in a Pandas dataframe

Problem: In my dataframe, when looping through zip codes in a weather API, I am getting the SAME values for column "desc" where every value is "cloudy" (this is incorrect for some zip codes). I think it is taking the value from the very last zip code in the list and applying it to every row in the Desc column.
But if I run only zip code 32303 and comment out all the other zip codes, the value for "Desc" is correct, it is now correctly listed as sunny/clear - - this proves the values when looping are incorrect.
Heck, it's Florida! ;)
Checking other weather sources I know "sunny/clear" is the correct value for 32303, not "cloudy". So for some reason, iterating is breaking on the column Desc only. I've tried so many options and am just stuck. Any ideas how to fix this?
import requests
import pandas as pd
api_key = 'a14ac278e4c4fdfd277a5b37e1dbe87a'
#Create a dictionary of zip codes for the team
zip_codes = {
55446: "You",
16823: "My Boo",
94086: "Your Boo",
32303: "Mr. Manatee",
95073: "Me"
}
# Create a list of zip codes
zip_list = list(zip_codes.keys())
# Create a list of names
name_list = list(zip_codes.values())
#For team data, create a pandas DataFrame from the dictionary
df1 = pd.DataFrame(list(zip_codes.items()),
columns=['Zip Code', 'Name'])
# Create empty lists to hold the API response data
city_name = []
description = []
weather = []
feels_like = []
wind = []
clouds = []
# Loop through each zip code
for zip_code in zip_list:
# Make a request to the OpenWeatherMap API
url = f"http://api.openweathermap.org/data/2.5/weather?zip={zip_code},us&units=imperial&appid={api_key}"
response = requests.get(url).json()
# Store the response data in the appropriate empty list
city_name.append(response['name'])
description = response['weather'][0]['main']
weather.append(response['main']['temp'])
feels_like.append(response['main']['feels_like'])
wind.append(response['wind']['speed'])
clouds.append(response['clouds']['all'])
# rain.append(response['humidity']['value'])
# For weather data, create df from lists
df2 = pd.DataFrame({
'City': city_name,
'Desc': description,
'Temp (F)': weather,
'Feels like': feels_like,
'Wind (mph)': wind,
'Clouds %': clouds,
# 'Rain (1hr)': rain,
})
# Merge df1 & df2, round decimals, and don't display index or zip.
df3=pd.concat([df1,df2],axis=1,join='inner').drop('Zip Code', axis=1)
df3[['Temp (F)', 'Feels like', 'Wind (mph)', 'Clouds %']] = df3[['Temp (F)', 'Feels like', 'Wind (mph)', 'Clouds %']].astype(int)
# Don't truncate df
pd.set_option('display.width', 150)
# Print the combined DataFrames
display(df3.style.hide_index())
Example output, note that "Desc" all have the same value "Clouds" but I know that is not correct as some are different.
Name City Desc Temp (F) Feels like Wind (mph) Clouds %
You Minneapolis Clouds 1 -10 12 100
My Boo Bellefonte Clouds 10 -1 15 100
Your Boo Sunnyvale Clouds 54 53 6 75
Mr. Manatee Tallahassee Clouds 49 49 3 0
Me Soquel Clouds 53 52 5 100
For example, if I comment out all the zip codes except for 32303: "Mr. Manatee", then I get a different value:
Name City Desc Temp (F) Feels like Wind (mph) Clouds %
Mr. Manatee Tallahassee Clear 49 49 3 0
To solve this, I tried another approach, below, which DOES give correct values for each zip code. The problem is that several of the columns are json values, and if I can't fix the code above, then I need to parse them and show only the relevant values. But my preference would be to fix the code above!
import requests
import pandas as pd
import json
zip_codes = {
95073: "Me",
55446: "You",
16823: "My Boo",
94086: "Your Boo",
32303: "Mr. Manatee"
}
import pandas as pd
import requests
# Create a list of zip codes
zip_list = list(zip_codes.keys())
# Create a list of names
name_list = list(zip_codes.values())
# Create a list of weather data
weather_list = []
# Set the API key
api_key = 'a14ac278e4c4fdfd277a5b37e1dbe87a'
# Get the weather data from the openweather API
for zip_code in zip_list:
api_url = f'http://api.openweathermap.org/data/2.5/weather?zip={zip_code},us&units=imperial&appid={api_key}'
response = requests.get(api_url).json()
weather_list.append(response)
# Create the dataframe
df = pd.DataFrame(weather_list)
# Add the name column
df['Name'] = name_list
# Parse the 'weather' column
#THIS DOESN'T WORK! df.weather.apply(lambda x: x[x]['main'])
# Drop unwanted columns
df.drop(['coord', 'base', 'visibility','dt', 'sys', 'timezone','cod'], axis=1)
I tried a different approach but got unusable json values. I tried various ways to fix looping in my first approach but I still get the same values for "Desc" instead of unique values corresponding to each zip code.
Like jqurious said, you had a bug in your code:
description = response['weather'][0]['main']
This means description stores the description of the final zip code in the dictionary and will repeat that across the whole dataframe. No wonder they are all the same.
Since you are collecting data to build a dataframe, it's better to use a list of dictionaries rather than a series of lists:
data = []
for zip_code in zip_list:
url = f"http://api.openweathermap.org/data/2.5/weather?zip={zip_code},us&units=imperial&appid={api_key}"
response = requests.get(url).json()
data.append({
"City": response["name"],
"Desc": response["weather"][0]["main"],
"Temp (F)": response["main"]["temp"],
"Feels like": response["main"]["feels_like"],
"Wind (mph)": response["wind"]["speed"],
"Clouds %": response["clouds"]["all"]
})
# You don't need to redefine the column names here
df2 = pd.DataFrame(data)

Python data source - first two columns disappear

I have started using PowerBI and am using Python as a data source with the code below. The source data can be downloaded from here (it's about 700 megabytes). The data is originally from here (contained in IOT_2019_pxp.zip).
import pandas as pd
import numpy as np
import os
path = /path/to/file
to_chunk = pd.read_csv(os.path.join(path,'A.txt'), delimiter = '\t', header = [0,1], index_col = [0,1],
iterator=True, chunksize=1000)
def chunker(to_chunk):
to_concat = []
for chunk in to_chunk:
try:
to_concat.append(chunk['BG'].loc['BG'])
except:
pass
return to_concat
A = pd.concat(chunker(to_chunk))
I = np.identity(A.shape[0])
L = pd.DataFrame(np.linalg.inv(I-A), index=A.index, columns=A.columns)
The code simply:
Loads the file A.txt, which is a symmetrical matrix. This matrix has every sector in every region for both rows and columns. In pandas, these form a MultiIndex.
Filters just the region that I need which is BG. Since it's a symmetrical matrix, both row and column are filtered.
The inverse of the matrix is calculated giving us L, which I want to load into PowerBI. This matrix now just has a single regular Index for sector.
This is all well and good however when I load into PowerBI, the first column (sector names for each row i.e. the DataFrame Index) disappears. When the query gets processed, it is as if it were never there. This is true for both dataframes A and L, so it's not an issue of data processing. The column of row names (the DataFrame index) is still there in Python, PowerBI just drops it for some reason.
I need this column so that I can link these tables to other tables in my data model. Any ideas on how to keep it from disappearing at load time?
For what it's worth, calling reset_index() removed the index from the dataframes and they got loaded like regular columns. For whatever reason, PBI does not properly load pandas indices.
For a regular 1D index, I had to do S.reset_index().
For a MultiIndex, I had to do L.reset_index(inplace=True).

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that:
reads in a hdfs parquet file
converts it to a pandas dataframe
loops through specific columns and changes some values
writes the dataframe back to a parquet file
Then the parquet file is imported back into hdfs using impala-shell.
The issue I'm having appears to be with step 2. I have it print out the contents of the dataframe immediately after it reads it in and before any changes are made in step 3. It appears to be changing the datatypes and the data of some fields, which causes problems when it writes it back to a parquet file. Examples:
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
fields that should be an Int with a value of 0 in the database are changed to "0.00000" and turned into a float in the dataframe.
It appears that it is actually changing these values, because when it writes the parquet file and I import it into hdfs and run a query, I get errors like this:
WARNINGS: File '<path>/test.parquet' has an incompatible Parquet schema for column
'<database>.<table>.tport'. Column type: INT, Parquet schema:
optional double tport [i:1 d:1 r:0]
I don't know why it would alter the data and not just leave it as-is. If this is what's happening, I don't know if I need to loop over every column and replace all these back to their original values, or if there is some other way to tell it to leave them alone.
I have been using this reference page:
http://arrow.apache.org/docs/python/parquet.html
It uses
pq.read_table(in_file)
to read the parquet file and then
df = table2.to_pandas()
to convert to a dataframe that I can loop through and change the columns. I don't understand why it's changing the data, and I can't find a way to prevent this from happening. Is there a different way I need to read it than read_table?
If I query the database, the data would look like this:
tport
0
1
My print(df) line for the same thing looks like this:
tport
0.00000
nan
nan
1.00000
Here is the relevant code. I left out the part that processes the command-line arguments since it was long and it doesn't apply to this problem. The file passed in is in_file:
import sys, getopt
import random
import re
import math
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow as pa
import os.path
# <CLI PROCESSING SECTION HERE>
# GET LIST OF COLUMNS THAT MUST BE SCRAMBLED
field_file = open('scrambler_columns.txt', 'r')
contents = field_file.read()
scrambler_columns = contents.split('\n')
def scramble_str(xstr):
#print(xstr + '_scrambled!')
return xstr + '_scrambled!'
parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)
metadata = pq.read_metadata(in_file)
df = table2.to_pandas() #dataframe
print('rows: ' + str(df.shape[0]))
print('cols: ' + str(df.shape[1]))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
#df.fillna(value='', inplace=True) # np.nan # \xa0
print(df) # print before making any changes
cols = list(df)
# https://pythonbasics.org/pandas-iterate-dataframe/
for col_name, col_data in df.iteritems():
#print(cols[index])
if col_name in scrambler_columns:
print('scrambling values in column ' + col_name)
for i, val in col_data.items():
df.at[i, col_name] = scramble_str(str(val))
print(df) # print after making changes
print(parquet_file.num_row_groups)
print(parquet_file.read_row_group(0))
# WRITE NEW PARQUET FILE
new_table = pa.Table.from_pandas(df)
writer = pq.ParquetWriter(out_file, new_table.schema)
for i in range(1):
writer.write_table(new_table)
writer.close()
if os.path.isfile(out_file) == True:
print('wrote ' + out_file)
else:
print('error writing file ' + out_file)
# READ NEW PARQUET FILE
table3 = pq.read_table(out_file)
df = table3.to_pandas() #dataframe
print(df)
EDIT
Here are the datatypes for the 1st few columns in hdfs
and here are the same ones that are in the pandas dataframe:
id object
col1 float64
col2 object
col3 object
col4 float64
col5 object
col6 object
col7 object
It appears to convert
String to object
Int to float64
bigint to float64
How can I tell pandas what data types the columns should be?
Edit 2: I was able to find a workaround by directly processing the pyarrow tables. Please see my question and answers here: How to update data in pyarrow table?
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
This is expected. It's just how pandas print function is defined.
It appears to convert String to object
This is also expected. Numpy/pandas does not have a dtype for variable length strings. It's possible to use a fixed-length string type but that would be pretty unusual.
It appears to convert Int to float64
This is also expected since the column has nulls and numpy's int64 is not nullable. If you would like to use Pandas's nullable integer column you can do...
def lookup(t):
if pa.types.is_integer(t):
return pd.Int64Dtype()
df = table.to_pandas(types_mapper=lookup)
Of course, you could create a more fine grained lookup if you wanted to use both Int32Dtype and Int64Dtype, this is just a template to get you started.

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

Round pandas timestamp series to seconds - then save to csv without ms/ns resolution

I have a dataframe, df with index: pd.DatetimeIndex. The individual timestamps are changed from 2017-12-04 08:42:12.173645000 to 2017-12-04 08:42:12 using the excellent pandas rounding command:
df.index = df.index.round("S")
When stored to csv, this format is kept (which is exactly what I want). I also need a date-only column, and this is now easily created:
df = df.assign(DateTimeDay = df.index.round("D"))
When stored to csv-file using df.to_csv(), this does write out the entire timestamp (2017-12-04 00:00:00), except when it is the ONLY column to be saved. So, I add the following command before save:
df["DateTimeDay"] = df["DateTimeDay"].dt.date
...and the csv-file looks nice again (2017-12-04)
Problem description
Now over to the question, I have two other columns with timestamps on the same format as above (but different - AND - with some very few NaNs). I want to also round these to seconds (keeping NaNs as NaNs of course), then make sure that when written to csv, they are not padded with zeros "below the second resolution". Whatever I try, I am simply not able to do this.
Additional information:
print(df.dtypes)
print(df.index.dtype)
...all results in datetime64[ns]. If I convert them to an index:
df["TimeCol2"] = pd.DatetimeIndex(df["TimeCol2"]).round("s")
df["TimeCol3"] = pd.DatetimeIndex(df["TimeCol3"]).round("s")
...it works, but the csv-file still pads them with unwanted and unnecessary zeros.
Optimal solution: No conversion of the columns (like above) or use of element-wise apply unless they are quick (100+ million rows). My dream command would be like this:
df["TimeCol2"] = df["TimeCol2"].round("s") # Raises TypeError: an integer is required (got type str)
You can specify the date format for datetime dtypes when calling to_csv:
In[170]:
df = pd.DataFrame({'date':[pd.to_datetime('2017-12-04 07:05:06.767')]})
df
Out[170]:
date
0 2017-12-04 07:05:06.767
In[171]:
df.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[171]: ',date\n0,2017-12-04 07:05:06\n'
If you want to round the values, you need to round prior to writing to csv:
In[173]:
df1 = df['date'].dt.round('s')
df1.to_csv(date_format='%Y-%m-%d %H:%M:%S')
Out[173]: '0,2017-12-04 07:05:07\n'

Resources