Pandas - iterating to fill values of a dataframe - python-3.x

I'm trying to build a data-frame of time series data. I have to retrieve the data from an API and every (i,j) entry in the data-frame (where "i" is the row and "j" is the column) has to be iterated through and filled individually.
Here's an idea of the type of thing i'm trying to do (note the API i'm using doesn't have historical data for what i'm trying to analyze):
import pandas as pd
import numpy as np
import time
def retrievedata(string):
take string
do some stuff with api
return float
label_list = ['label1','label1','label1', etc...]
discrete_points = 720
df = pd.DataFrame(index=np.arange(0, discrete_points), columns=(i for i in label_list))
So at this point I've pre-allocated a data frame. What comes next is the issue.
Now, I want to iterate over it and assign values to every (i,j) entry in the data frame based on a function i wrote to pull data. Note that the function I wrote has to be specific to a certain column (as it is taking as input the column label). And on top of that, each row will have different values b/c it is time-series data.
EDIT: Yuck, I found a gross way to make it work:
for row in range(discrete_points):
for label in label_list:
df.at[row, label] = retrievedata(label)
This is obviously a non-pythonic, non-numpy, non-pandas way of doing things. So i'd like to find a nicer and more efficient/less computing power intensive way of doing this.
I'm assuming it's gonna have to be some combination of: iter.rows(); iter.tuples(); df.loc(); df.at()
I'm stumped though.
Any ideas?

Related

Plotting graphs with Altair from a Pandas Dataframe

I am trying to read table values from a spreadsheet and plot different charts using Altair.
The spreadsheet can be found here
import pandas as pd
xls_file = pd.ExcelFile('PET_PRI_SPT_S1_D.xls')
xls_file
crude_df = xls_file.parse('Data 1')
crude_df
I am setting the second row values as column headers of the data frame.
crude_df.columns = crude_df.iloc[1]
crude_df.columns
Index(['Date', 'Cushing, OK WTI Spot Price FOB (Dollars per Barrel)',
'Europe Brent Spot Price FOB (Dollars per Barrel)'],
dtype='object', name=1)
The following is a modified version of Altair code got from documentation examples
crude_df_header = crude_df.head(100)
import altair as alt
alt.Chart(crude_df_header).mark_circle().encode(
# Mapping the WTI column to y-axis
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
This does not work.
Error is shown as
TypeError: Object of type datetime is not JSON serializable
How to make 2 D plots with this data?
Also, how to make plots for number of values exceeding 5000 in Altair? Even this results in errors.
Your error is due to the way you parsed the file. You have set the column name but forgot to remove the first two rows, including the ones which are now the column names. The presence of these string values resulted in the error.
The proper way of achieving what you are looking for will be as follow:
import pandas as pd
import altair as alt
crude_df = pd.read_excel(open('PET_PRI_SPT_S1_D.xls', 'rb'),
sheet_name='Data 1',index_col=None, header=2)
alt.Chart(crude_df.head(100)).mark_circle().encode(
x ='Date',
y='Cushing, OK WTI Spot Price FOB (Dollars per Barrel)'
)
For the max rows issue, you can use the following
alt.data_transformers.disable_max_rows()
But be mindful of the official warning
If you choose this route, please be careful: if you are making multiple plots with the dataset in a particular notebook, the notebook will grow very large and performance may suffer.

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Accessing columns in a csv using python

I am trying to access data from a CSV using python. I am able to access entire columns for data values; however, I want to also access rows, an use like and indexed coordinate system (0,1) being column 0, row 1. So far I have this:
#Lukas Robin
#25.07.2021
import csv
with open("sun_data.csv") as sun_data:
sunData = csv.reader(sun_data, delimiter=',')
global data
for data in sunData:
print(data)
I don't normally use data tables or CSV, so this is a new area for me.
As mentioned in the comment, you could make the jump to using pandas and spend a little time learning that. It would be a good investment of time if you plan to do much data analysis or work with data tables regularly.
If you just want to pull in a table of numbers and access it as you request, you are perfectly fine using csv package and doing that. Below is an example...
If your .csv file has a header in it, you can simply add in next(sun_data) before starting the inner loop to advance the iterator and let that data fall on the floor...
import csv
f_in = 'data_table.csv'
data = [] # a container to hold the results
with open(f_in, 'r') as source:
sun_data = csv.reader(source, delimiter=',')
for row in sun_data:
# convert the read-in values to float data types (or ints or ...)
row = [float(t) for t in row]
# append it to the data table
data.append(row)
print(data[1][0])

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Adding numerical values from dict to a new column in a Pandas DataFrame

I am practicing machine learning and working with a movie/rating dataset. I am trying to create a new column in the dataframe which numerically identifies each genre (around 1300 of them). My logic was to create a dictionary of the unique genres and label with a integer. Then create a for loop to iterate through each row of the dataframe, checking the genre of each, then assigning its appropriate value to a new column named "genre_Id". However this has been causing a infinite loop in which I can not even break with ctrl-c. Same issue when working in Jupyter ( Interrupt Kernel fails to stop it). Below is a summarized version of my approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
movies_data = pd.read_csv("C://mypython/moviedata/movies.csv")
ratings_data = pd.read_csv("C://mypython/moviedata/ratings.csv")
joined = pd.merge(movies_data,ratings_data, how = 'inner', on=['movieId'])
print(joined.head())
pd.options.display.float_format = '{:,.2f}'.format
genres = joined['genres'].unique()
genre_dict = {}
Id = 1
for i in genres:
genre_dict[i] = Id
Id += 1
joined['genre_id'] = 0
increment = 0
for i in joined['genres']:
if i in genre_dict:
joined['genre_id'][increment] = genre_dict[i]
increment += 1
I know I should probably be taking a smaller sample to work with as there is about 20,000,000 rows in the dataset but I figured I'd try this as a exercise.
I also recieve the "setting values from copy warning" though this hasn't caused me issues in the past for my other projects. Any thoughts on how to do this would be greatly appreciated.
EDIT Found a solution using the Series map feature.
joined['genre_id'] = joined.genres.map(genre_dict)
I have no permission to just comment. This is a suggestion and right procedure to handle categorical values in a dataset. You can use inbuilt sklearn.preprocessing.OneHotEncoder function which do the work you wanted to do.
For better understanding with examples check this One Hot Encode Sequence Data in Python. Let me know if this works for you.

Resources