Create a BOOL column based on conditions in other columns - python-3.x

I have a dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD'))
I would like to create another BOOL column or YES/NO column based on the sum of column A and B > 150
I am trying a generator kind of solution:
df['Truth'] = ['Yes' for i in df.columns.values if (df.A+df.B > 150)]
I know this does not work but I keep getting another error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How do I code this and what does this error mean?

How to get a column of Boolean values:
(df.A + df.B) > 150 generates a pandas.Series of Boolean values. Assign it to a column name.
import pandas as pd
import numpy as np
# sample data
np.random.seed(2)
df = pd.DataFrame(np.random.randint(0, 100, size=(15, 4)), columns=list('ABCD'))
# create the Boolean column
df['Truth'] = (df.A + df.B) > 150
# display(df)
A B C D Truth
0 40 15 72 22 False
1 43 82 75 7 False
2 34 49 95 75 False
3 85 47 63 31 False
4 90 20 37 39 False
5 67 4 42 51 False
6 38 33 58 67 False
7 69 88 68 46 True
8 70 95 83 31 True
9 66 80 52 76 False
10 50 4 90 63 False
11 79 49 39 46 False
12 8 50 15 8 False
13 17 22 73 57 False
14 90 62 83 96 True
What does this error mean:
What is shown in the question is a list-comprehension, not a generator.
(df.A + df.B) returns a pandas.Series, which can be compared to a value like 150
The issue with the list comprehension is if (df.A+df.B > 150), which causes the ValueError because there is a series, not just a single Boolean.
Another issue is df.columns.values is just a list of the column names.
See Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() for further details on the error.

Related

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

Diagonals in North East Direction - Time Limit Exceeded in Python 3.8.3

The program must accept an integer matrix of size RxC as the input. The program must print the integers in the diagonals in the North-East directions of the matrix in the seprate line as output.
Boundary:
2<=R,C<=100
Time Limit : 500ms
Example 1:
Input:
3 3
73 77 76
71 17 87
37 73 98
Output:
73
71 77
37 17 76
73 87
98
Example 2:
Input:
4 6
97 78 7 39 92 45
68 100 49 95 97 100
59 41 81 22 26 100
46 37 81 12 93 10
Output:
97
68 78
59 100 7
46 41 49 39
37 81 95 92
81 22 97 45
12 26 100
93 100
10
My Code:
row,col = map(int,input().split())
matrix = [list(map(int,input().split())) for i in range(row)]
# Redundancy of row and col
rep = []
for i in range(row):
for j in range(col):
b = []
for k in range(i,row):
if (j,k) not in rep:
b.append(matrix[k][j])
rep.append((j,k))
j-=1
if j<0:break
if len(b):print(*(b[::-1]))
My code works well but when the matrix is of size (100,100) it exceeds the given time limit, is there a way to reduce it. Thanks in advance
Note : No External Libraries should be used!
The trick here is to realize that because each number only appears in the solution once, so we really only need to evaluate each value once.
We can also see that each matrix will result in row + col - 1 number of North-East direction diagonals, which will help us.
# Original code
row,col = map(int,input().split())
# I won't turn them into ints, strings actually make it easier for my work
matrix = [input().split() for i in range(row)]
diagonals = [""] * (row + col - 1)
for i in range(row):
for j in range(col):
# determine which diagonal the number belongs to, and prepend it
diagonals[i + j] = "%s %s" % (matrix[i][j], diagonals[i + j])
# print out diagonals one at a time
for diagonal in diagonals: print(diagonal)
I never got the chance to run it, but this should give the general idea!
(new to SO, plz be nice :D)

When using min() - ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() [duplicate]

How can I reference the minimum value of two dataframes as part of a pandas dataframe equation? I tried using the python min() function which did not work. I'm sorry if this is well-documented somewhere but I have not been able to find a working solution for this problem. I am looking for something along the lines of this:
data['eff'] = pd.DataFrame([data['flow_h'], data['flow_c']]).min() *Cp* (data[' Thi'] - data[' Tci'])
I also tried to use pandas min() function, which is also not working.
min_flow = pd.DataFrame([data['flow_h'], data['flow_c']]).min()
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I was confused by this error. The data columns are just numbers and a name, I wasn't sure where the index comes into play.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 10
flow = {'flow_c': [np.random.randint(100) for _ in range(rows)],
'flow_d': [np.random.randint(100) for _ in range(rows)],
'flow_h': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(flow)
# display(data)
flow_c flow_d flow_h
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50
If you are trying to get the row-wise mininum of two or more columns, use pandas.DataFrame.min. Note that by default axis=0; specifying axis=1 is necessary.
data['min_c_h'] = data[['flow_h','flow_c']].min(axis=1)
# display(data)
flow_c flow_d flow_h min_c_h
0 82 36 43 43
1 52 48 12 12
2 33 28 77 33
3 91 99 11 11
4 44 95 27 27
5 5 94 64 5
6 98 3 88 88
7 73 39 92 73
8 26 39 62 26
9 56 74 50 50
If you like to get a single minimum value of multiple columns:
data[['flow_h','flow_c']].min().min()
the first "min()" calculates the minimum per column and returns a pandas series. The second "min" returns the minimum of the minimums per column.

Python: SettingWithCopyWarning when trying to set value to True based on condition

Data:
Date Stock Peak Trough Price
2002-01-01 33.78 False False 25
2002-01-02 34.19 False False 35
2002-01-03 35.44 False False 33
2002-01-04 36.75 False False 38
I use this line of code to set 'Peak' to true in each row whenever the price of a stock is higher or equal to the max value in the row starting from column 4:
df['Peak'] = np.where(df.iloc[:,4:].max(axis=1) >= df[stock], 'False', 'True')
However, I'm trying to make it so that the first X and last Y rows are not affected. Let's say X and Y are both 10 in this example. I modified it like this:
df.iloc[10:-10]['Peak'] = np.where(df.iloc[10:-10,4:].max(axis=1) >= df.iloc[10:-10][stock], 'False', 'True')
This gives me an error SettingWithCopyWarning and also doesn't work anymore. Does anyone have an idea how to get the desired result so that the first X and last Y rows are always False?
I believe you need a get_loc to specify column index when assigning using df.iloc[] :
df.iloc[10:,df.columns.get_loc('year')] = (np.where(df.iloc[10:,4:].max(axis=1)
>= df.iloc[10:,df.columns.get_loc('stock')],'False', 'True'))
To try here is a test case:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,(5,4)),columns=list('ABCD'))
print(df)
A B C D
0 66 92 98 17
1 83 57 86 97
2 96 47 73 32
3 46 96 25 83
4 78 36 96 80
Trying to set column D as np.nan from index 2 we get the same error:
df.iloc[2:]['D']=np.nan
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
Trying the same avoiding a chained assignment using get_loc (successful)
df.iloc[2:,df.columns.get_loc('D')] = np.nan
print(df)
A B C D
0 66 92 98 17.0
1 83 57 86 97.0
2 96 47 73 NaN
3 46 96 25 NaN
4 78 36 96 NaN

Resources