Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame" - python-3.x

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

Related

Stuck using pandas to build RPG item generator

I am trying to build a simple random item generator for a game I am working on.
So far I am stuck trying to figure out how to store and access all of the data. I went with pandas using .csv files to store the data sets.
I want to add weighted probabilities to what items are generated so I tried to read the csv files and compile each list into a new set.
I got the program to pick a random set but got stuck when trying to pull a random row from that set.
I am getting an error when I use .sample() to pull the item row which makes me think I don't understand how pandas works. I think I need to be creating new lists so I can later index and access the various statistics of the items once one is selected.
Once I pull the item I was intending on adding effects that would change the damage and armor and such displayed. So I was thinking of having the new item be its own list then use damage = item[2] + 3 or whatever I need
error is: AttributeError: 'list' object has no attribute 'sample'
Can anyone help with this problem? Maybe there is a better way to set up the data?
here is my code so far:
import pandas as pd
import random
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
def get_item():
item_class = [random.choices(df, weights=(45,40,15), k=1)] #this part seemed to work. When I printed item_class it printed one of the entire lists at the correct odds
item = item_class.sample()
print (item) #to see if the program is working
get_item()
I think you are getting slightly confused with lists vs list elements. This should work. I stubbed your dfs with simple ones
import pandas as pd
import random
# Actual data. Comment it out if you do not have the csv files
df = [pd.read_csv('weapons.csv'), pd.read_csv('armor.csv'), pd.read_csv('aether_infused.csv')]
# My stubs -- uncomment and use this instead of the line above if you want to run this specific example
# df = [pd.DataFrame({'weapons' : ['w1','w2']}), pd.DataFrame({'armor' : ['a1','a2', 'a3']}), pd.DataFrame({'aether' : ['e1','e2', 'e3', 'e4']})]
def get_item():
# I removed [] from the line below -- choices() already returns a list of length 1
item_class = random.choices(df, weights=(45,40,15), k=1)
# I added [0] to choose the first element of item_class which is a list of length 1 from the line above
item = item_class[0].sample()
print (item) #to see if the program is working
get_item()
prints random rows from random dataframes that I setup such as
weapons
1 w2

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Dask processing of uneven dataframes with fuzzywuzzy

I'm attempting to merge two large dataframes (one 50k+ values, and another with 650k+ values - pared down from 7M+). Merging/matching is being done via fuzzywuzzy, to find which string in the first dataframe matches which string in the other most closely.
At the moment, it takes about 3 minutes to test 100 rows for variables. Consequently, I'm attempting to institute Dask to help with the processing speed. In doing so, Dask returns the following error: 'NotImplementedError: Series getitem in only supported for other series objects with matching partition structure'
Presumably, the error is due to my dataframes not being of equal size. In trying to set a chunksize when converting the my pandas dataframes to dask dataframes, I receive an error (TypeError: 'float' object cannot be interpreted as an integer) even though I previous forced all my datatypes in each dataframe to 'objects'. Consequently, I was forced to use the npartitions parameter in the dataframe conversion, which then leads to the 'NotImplementedError' above.
I've tried to standardize the chunksize with partitions with a mathematical index, and also tried using the npartitions parameter to no effect, and resulting in the same NotImplementedError.
As mentioned my efforts to utilize this without Dask have been successful, but far too slow to be useful.
I've also taken a look at these questions/responses:
- Different error
- No solution presented
- Seems promising, but results are still slow
''''
aprices_filtered_ddf = dd.from_pandas(prices_filtered, chunksize = 25e6) #prices_filtered: 404.2MB
all_data_ddf = dd.from_pandas(all_data, chunksize = 25e6) #all_data: 88.7MB
# import dask
client = Client()
dask.config.set(scheduler='processes')
# Define matching function
def match_name(name, list_names, min_score=0):
# -1 score incase we don't get any matches max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iterating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.token_set_ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our players without salaries found above for name in prices_filtered_ddf['ndc_description'][:100]:
# Use our method to find best match, we can set a threshold here
match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
# New dict for storing data
dict_ = {}
dict_.update({'ndc_description_agg' : name})
dict_.update({'ndc_description' : match[0]})
dict_.update({'score' : match[1]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table
Here's the full error:
NotImplementedError Traceback (most recent call last)
<ipython-input-219-e8d4dcb63d89> in <module>
3 dict_list = []
4 # iterating over our players without salaries found above
----> 5 for name in prices_filtered_ddf['ndc_description'][:100]:
6 # Use our method to find best match, we can set a threshold here
7 match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
C:\Anaconda\lib\site-packages\dask\dataframe\core.py in __getitem__(self, key)
2671 return Series(graph, name, self._meta, self.divisions)
2672 raise NotImplementedError(
-> 2673 "Series getitem in only supported for other series objects "
2674 "with matching partition structure"
2675 )
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
''''
I expect that the merge_table will return, in a relatively short time, a dataframe with data for each of the update columns. At the moment, it's extremely slow.
I'm afraid there are a number of problems with this question, so after pointing these out, I can only provide some general guidance.
The traceback shown is clearly not produced by the code above
The indentation and syntax are broken
A distributed client is made, then config set not to use it ("processes" is not the distributed scheduler)
The client object is called, client(...), but it is not callable, this shouldn't work at all
The main processing function, match_name is called directly; how do you expect Dask to intervene?
You don't ever call compute(), so in the code given, I'm not sure Dask is invoked at all.
What you actually want to do:
Load your smaller, reference dataframe using pandas, and call client.scatter to make sure all the workers have it
Load your main data with dd.read_csv
Call df.map_partitions(..) to process your data, where the function you pass should take two pandas dataframes, and work row-by-row.

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Resources