Python: SettingWithCopyWarning when trying to set value to True based on condition - python-3.x

Data:
Date Stock Peak Trough Price
2002-01-01 33.78 False False 25
2002-01-02 34.19 False False 35
2002-01-03 35.44 False False 33
2002-01-04 36.75 False False 38
I use this line of code to set 'Peak' to true in each row whenever the price of a stock is higher or equal to the max value in the row starting from column 4:
df['Peak'] = np.where(df.iloc[:,4:].max(axis=1) >= df[stock], 'False', 'True')
However, I'm trying to make it so that the first X and last Y rows are not affected. Let's say X and Y are both 10 in this example. I modified it like this:
df.iloc[10:-10]['Peak'] = np.where(df.iloc[10:-10,4:].max(axis=1) >= df.iloc[10:-10][stock], 'False', 'True')
This gives me an error SettingWithCopyWarning and also doesn't work anymore. Does anyone have an idea how to get the desired result so that the first X and last Y rows are always False?

I believe you need a get_loc to specify column index when assigning using df.iloc[] :
df.iloc[10:,df.columns.get_loc('year')] = (np.where(df.iloc[10:,4:].max(axis=1)
>= df.iloc[10:,df.columns.get_loc('stock')],'False', 'True'))
To try here is a test case:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,(5,4)),columns=list('ABCD'))
print(df)
A B C D
0 66 92 98 17
1 83 57 86 97
2 96 47 73 32
3 46 96 25 83
4 78 36 96 80
Trying to set column D as np.nan from index 2 we get the same error:
df.iloc[2:]['D']=np.nan
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
Trying the same avoiding a chained assignment using get_loc (successful)
df.iloc[2:,df.columns.get_loc('D')] = np.nan
print(df)
A B C D
0 66 92 98 17.0
1 83 57 86 97.0
2 96 47 73 NaN
3 46 96 25 NaN
4 78 36 96 NaN

Related

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

Create a BOOL column based on conditions in other columns

I have a dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD'))
I would like to create another BOOL column or YES/NO column based on the sum of column A and B > 150
I am trying a generator kind of solution:
df['Truth'] = ['Yes' for i in df.columns.values if (df.A+df.B > 150)]
I know this does not work but I keep getting another error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How do I code this and what does this error mean?
How to get a column of Boolean values:
(df.A + df.B) > 150 generates a pandas.Series of Boolean values. Assign it to a column name.
import pandas as pd
import numpy as np
# sample data
np.random.seed(2)
df = pd.DataFrame(np.random.randint(0, 100, size=(15, 4)), columns=list('ABCD'))
# create the Boolean column
df['Truth'] = (df.A + df.B) > 150
# display(df)
A B C D Truth
0 40 15 72 22 False
1 43 82 75 7 False
2 34 49 95 75 False
3 85 47 63 31 False
4 90 20 37 39 False
5 67 4 42 51 False
6 38 33 58 67 False
7 69 88 68 46 True
8 70 95 83 31 True
9 66 80 52 76 False
10 50 4 90 63 False
11 79 49 39 46 False
12 8 50 15 8 False
13 17 22 73 57 False
14 90 62 83 96 True
What does this error mean:
What is shown in the question is a list-comprehension, not a generator.
(df.A + df.B) returns a pandas.Series, which can be compared to a value like 150
The issue with the list comprehension is if (df.A+df.B > 150), which causes the ValueError because there is a series, not just a single Boolean.
Another issue is df.columns.values is just a list of the column names.
See Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() for further details on the error.

When using min() - ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() [duplicate]

How can I reference the minimum value of two dataframes as part of a pandas dataframe equation? I tried using the python min() function which did not work. I'm sorry if this is well-documented somewhere but I have not been able to find a working solution for this problem. I am looking for something along the lines of this:
data['eff'] = pd.DataFrame([data['flow_h'], data['flow_c']]).min() *Cp* (data[' Thi'] - data[' Tci'])
I also tried to use pandas min() function, which is also not working.
min_flow = pd.DataFrame([data['flow_h'], data['flow_c']]).min()
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I was confused by this error. The data columns are just numbers and a name, I wasn't sure where the index comes into play.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 10
flow = {'flow_c': [np.random.randint(100) for _ in range(rows)],
'flow_d': [np.random.randint(100) for _ in range(rows)],
'flow_h': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(flow)
# display(data)
flow_c flow_d flow_h
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50
If you are trying to get the row-wise mininum of two or more columns, use pandas.DataFrame.min. Note that by default axis=0; specifying axis=1 is necessary.
data['min_c_h'] = data[['flow_h','flow_c']].min(axis=1)
# display(data)
flow_c flow_d flow_h min_c_h
0 82 36 43 43
1 52 48 12 12
2 33 28 77 33
3 91 99 11 11
4 44 95 27 27
5 5 94 64 5
6 98 3 88 88
7 73 39 92 73
8 26 39 62 26
9 56 74 50 50
If you like to get a single minimum value of multiple columns:
data[['flow_h','flow_c']].min().min()
the first "min()" calculates the minimum per column and returns a pandas series. The second "min" returns the minimum of the minimums per column.

Remove index from dataframe using Python

I am trying to create a Pandas Dataframe from a string using the following code -
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
I am getting the following result -
0 1 2
0 A B C
1 0 34 88
2 2 45 200
3 3 47 65
4 4 32 140
5 None None
But I need something like the following -
A B C
0 34 88
2 45 200
3 47 65
4 32 140
I added "index = False" while creating the dataframe like -
df = pd.DataFrame([x.split(';') for x in data.split('\n')],index = False)
But, it gives me an error -
TypeError: Index(...) must be called with a collection of some kind, False
was passed
How is this achievable?
Use read_csv with StringIO and index_col parameetr for set first column to index:
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
df = pd.read_csv(pd.compat.StringIO(input_string),sep=';', index_col=0)
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
Your solution should be changed with split by default parameter (arbitrary whitespace), pass to DataFrame all values of lists without first with columns parameter and if need first column to index add DataFrame.set_axis:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index('A')
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
For general solution use first value of first list in set_index:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index(L[0][0])
EDIT:
You can set column name instead index name to A value:
df = df.rename_axis(df.index.name, axis=1).rename_axis(None)
print (df)
A B C
0 34 88
2 45 200
3 47 65
4 32 140
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split()])
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
df.set_index('A',inplace = True)
df
output
B C
A
0 34 88
2 45 200
3 47 65
4 32 140

Pandas: how to test that top-n-dataframe really results from original dataframe

I have a DataFrame, foo:
A B C D E
0 50 46 18 65 55
1 48 56 98 71 96
2 99 48 36 79 70
3 15 24 25 67 34
4 77 67 98 22 78
and another Dataframe, bar, which contains the greatest 2 values of each row of foo. All other values have been replaced with zeros, to create sparsity:
A B C D E
0 0 0 0 65 55
1 0 0 98 0 96
2 99 0 0 79 0
3 0 0 0 67 34
4 0 0 98 0 78
How can I test that every row in bar really contains the desired values?
One more thing: The solution should work with large DateFrames i.e. 20000 X 20000.
Obviously you can do that with looping and efficient sorting, but maybe a better way would be:
n = foo.shape[0]
#Test1:
#bar dataframe has original data except zeros for two values:
diff = foo - bar
test1 = ((diff==0).sum(axis=1) == 2) == n
#Test2:
#bar dataframe has 3 zeros on each line
test2 = ((bar==0).sum(axis=1) == 3) == n
#Test3:
#these 2 numbers that bar has are the max
bar2=bar.replace({0:pandas.np.nan(), inplace=True}
#the max of remaining values is smaller than the min of bar:
row_ok = (diff.max(axis=1) < bar.min(axis=1))
test3 = (ok.sum() == n)
I think this covers all cases, but haven't tested it all...

Resources