Yet another Pandas SettingWithCopyWarning question - python-3.x

Yes this question has been asked many times! No, I have still not been able to figure out how to run this boolean filter without generating the Pandas SettingWithCopyWarning warning.
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D['count'].iloc[x] = len(df_C) # triggers warning
I've tried:
Copying df_A and df_B in every possible place
Using a mask
Using a query
I know I can suppress the warning, but I don't want to do that.
What am I missing? I know it's probably something obvious.
Many thanks!

For more details on why you got SettingWithCopyWarning, I would suggest you to read this answer. It is mostly because selecting the columns df_D['count'] and then using iloc[x] does a "chained assignment" that is flagged this way.
To prevent it, you can get the position of the column you want in df_D and then use iloc for both the row and the column in the loop for:
pos_col_D = df_D.columns.get_loc['count']
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
df_D.iloc[x,pos_col_D ] = len(df_C) #no more warning
Also, because you compare all the values of df_A.age with the bounds of df_B.age_limits, I think you could improve the speed of your code using numpy.ufunc.outer, with ufunc being greater_equal and less_egal, and then sum over the axis=0.
#Setup
import numpy as np
import pandas as pd
df_A = pd.DataFrame({'age': [12,25,32]})
df_B = pd.DataFrame({'age_limits':[[3,99], [20,45], [15,30]]})
#your result
for x in range(len(df_A)):
df_C = df_A.loc[(df_A['age'] >= df_B['age_limits'].iloc[x][0]) &
(df_A['age'] <= df_B['age_limits'].iloc[x][1])]
print (len(df_C))
3
2
1
#with numpy
print ( ( np.greater_equal.outer(df_A.age, df_B.age_limits.str[0])
& np.less_equal.outer(df_A.age, df_B.age_limits.str[1]))
.sum(0) )
array([3, 2, 1])
so you can assign the previous line of code directly in df_D['count'] without loop for. Hope this work for you

Related

Unexpected output format in pandas.groupby.apply

does someone know why pandas behave differently when column which we use as BY in GROUPBY contains only 1 unique value? Specifically, if there is just 1 value and we return pandas.Series, returned output is basically transposed in comparison to multiple unique values:
dt = pd.date_range('2021-01-01', '2021-01-02 23:00', closed=None, freq='1H')
df = pd.DataFrame({'date':dt.date, 'vals': range(dt.shape[0])}, index=dt)
dt1 = pd.date_range('2021-01-01', '2021-01-01 23:00', closed=None, freq='1H')
df2 = pd.DataFrame({'date':dt1.date, 'vals': range(dt1.shape[0])}, index=dt1)
def f(row, ):
return row['vals']
print(df.groupby('date').apply(f).shape)
print(df2.groupby('date').apply(f).shape)
[out 1] (48,)
[out 2] (1, 24)
Is there some simple parameter I can use to make sure the behavior is consistent? Would it make sense to maybe sumbit it as bug-report due to inconsistency, or is it "expected" (I undestood from previous question that sometimes poor design or small part is not a bug)? (I still love pandas, just these small things can make their usage very painful)
squeeze()
DataFrame.squeeze() and Series.squeeze() can make the shapes consistent:
>>> df.groupby('date').apply(f).squeeze().shape
(48,)
>>> df2.groupby('date').apply(f).squeeze().shape
(24,)
squeeze=True (deprecated)
groupby() has a squeeze param:
squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
>>> df.groupby('date', squeeze=True).apply(f).shape
(48,)
>>> df2.groupby('date', squeeze=True).apply(f).shape
(24,)
This has been deprecated since pandas 1.1.0 and will be removed in the future.

Using while and if function together with a condition change

I am trying to use python to conduct a calculation which will sum the values in a column only for the time period that a certain condition is met.
However, the summation should begin when the conditions are met (runstat == 0 and oil >1). The summation should then stop at the point when oil == 0.
I am new to python so I am not sure how to do this.
I connected the code to a spreadsheet for testing purposes but the intent is to connect to live data. I figured a while loop in combination with an if function might work but I am not winning.
Basically I want to have the code start when runstat is zero and oil is higher than 0. It should stop summing the values of oil when the oil row becomes zero and then it should write the data to a SQL database (this I will figure out later - for now I just want to see if it can work).
This is what code I have tried so far.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
oil = df['oiltag']
runstat = df['runstattag']
def startup(oil,runstat):
while oil.all() > 0:
if oil > 0 and runstat == 0:
totaloil = sum(oil.all())
print(totaloil)
else:
return None
return
print(startup(oil.all(), runstat.all()))
It should sum the values in the column but it is returning: None
OK, so I think that what you want to do is get the subset of rows between the two conditions, then get a sum of those.
Method: Slice the dataframe to get the relevant rows and then sum.
import numpy as np
import pandas as pd
data = pd.read_excel('TagValues.xlsx')
df = pd.DataFrame(data)
df['oiltag'] = df['oiltag'].astype(float)
df['runstattag'] = df['runstattag'].astype(float)
def startup(dframe):
start_row = dframe[(dframe.oiltag > 0) & (dframe.runstattag == 0)].index[0]
end_row = dframe[(dframe.oiltag == 0) & (dframe.index > start_row)].index[0]
subset = dframe[start_row:end_row+1] # +1 because the end slice is non-inclusive
totaloil = subset.oiltag.sum()
return totaloil
print(startup(df))
This code will raise an error if it can't find a subset of rows which match your criteria. If you need to handle that case, then we could add some exception handling.
EDIT: Please note this assumes that your criteria is only expected to occur once per excel. If you have multiple “chunks” that you will want to sum then this will need tweaking.

How to save tuples output form for loop to DataFrame Python

I have some data 33k rows x 57 columns.
In some columns there is a data which I want to translate with dictionary.
I have done translation, but now I want to write back translated data to my data set.
I have problem with saving tuples output from for loop.
I am using tuples for creating good translation. .join and .append is not working in my case. I was trying in many case but without any success.
Looking for any advice.
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
for index, row in data.iterrows():
row["translated"] = (tuple(slownik.get(znak) for znak in row["1st_service"]))
I just want to see in print(data["1st_service"] a translated data not the previous one before for loop.
First of all, if your csv doesn't already have a 'translated' column, you'll have to add it:
import numpy as np
data['translated'] = np.nan
The problem is the row object you're trying to write to is only a view of the dataframe, it's not the dataframe itself. Plus you're missing square brackets for your list comprehension, if I'm understanding what you're doing. So change your last line to:
data.loc[index, "translated"] = tuple([slownik.get(znak) for znak in row["1st_service"]])
and you'll get a tuple written into that one cell.
In future, posting the exact error message you're getting is very helpful!
I have manage it, below working code:
data = pd.read_csv(filepath, engine="python", sep=";", keep_default_na=False)
data.columns = []
slownik = dict([ ])
trans = ' '
for index, row in data.iterrows():
trans += str(tuple([slownik.get(znak) for znak in row["1st_service"]]))
data['1st_service'] = trans.split(')(')
data.to_csv("out.csv", index=False)
Can you tell me if it is well done?
Maybe there is an faster way to do it?
I am doing it for 12 columns in one for loop, as shown up.

Boolean, Flatnonzero, Selecting a certain range in numpy in python

I have a data file in .txt that consists of 2 columns. The first one is my x values and the second column contains my y values.
What I am trying to do is quite simple. I want to identify where my x values are =>1700 and <=1735 so that I can get the repective y values within that x range. At the end I want to get the sum of those y values.
The following is the code I wrote.
import numpy as np
data = np.loadtxt('NI2_2.txt')
x_all= data[:,0]
y_all= data[:,1]
x_selected= np.flatnonzero(np.logical_and(x_all<=1700),(x_all=>1735))
y_selected= y_all[x_selected]
y_final= np.sum(y_selected)
I get an error message for my x_selected, saying that the syntax is not correct. Does someone see what is wrong with it?
Thanks!
Cece
Try using np.where:
y_selected = y_all[np.where((x_all >= 1700) & (x_all <= 1735))]
y_final = np.sum(y_selected)
EDIT:
Also you cannot write => in python. Use >=.
It may be only because the comparison operand is >= and not => but i can't try any further, sorry.

Update Pandas DF during while loop (Python3, Pandas)

Some background: My code takes user input and applies it to my DF to remove certain rows. This process repeats as many times as the user would like. Unfortunately, I am not sure how to update my DF within the while loop I have created so that it keeps the changes being made:
data = ({'hello':['the man','is a','good guy']})
df = pd.DataFrame(data)
def func():
while True:
n = input('Words: ')
if n == "Done":
break
elif n != "Done":
pattern = '^'+''.join('(?=.*{})'.format(word) for word in n.split())
df[df['hello'].str.contains(pattern)==False]
How do I update the DF at the end of each loop so the changes being made stay put?
Ok, I reevaluated your problem and my old answer was totally wrong of course.
What you want is the DataFrame.drop method. This can be done inplace.
mask = df['hello'].str.contains(pattern)
df.drop(mask, inplace=True)
This will update your DataFrame.
Looks to me like you've already done all the hard work, but there are two problems.
Your last line doesn't store the result anywhere. Most Pandas operations are not "in-place", which means you have to store the result somewhere to be able to use it later.
df is a global variable, and setting its value inside a function doesn't work, unless you explicitly have a line stating global df. See the good answers to this question for more detail.
So I think you just need to do:
df = df[df['hello'].str.contains(pattern)==False]
to fix problem one.
For problem two, at the end of func, do return df then when you call func call it like:
df = func(df)
OR, start func with the line
global df

Resources