When using min() - ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() [duplicate] - python-3.x

How can I reference the minimum value of two dataframes as part of a pandas dataframe equation? I tried using the python min() function which did not work. I'm sorry if this is well-documented somewhere but I have not been able to find a working solution for this problem. I am looking for something along the lines of this:
data['eff'] = pd.DataFrame([data['flow_h'], data['flow_c']]).min() *Cp* (data[' Thi'] - data[' Tci'])
I also tried to use pandas min() function, which is also not working.
min_flow = pd.DataFrame([data['flow_h'], data['flow_c']]).min()
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
I was confused by this error. The data columns are just numbers and a name, I wasn't sure where the index comes into play.
import pandas as pd
import numpy as np
np.random.seed(365)
rows = 10
flow = {'flow_c': [np.random.randint(100) for _ in range(rows)],
'flow_d': [np.random.randint(100) for _ in range(rows)],
'flow_h': [np.random.randint(100) for _ in range(rows)]}
data = pd.DataFrame(flow)
# display(data)
flow_c flow_d flow_h
0 82 36 43
1 52 48 12
2 33 28 77
3 91 99 11
4 44 95 27
5 5 94 64
6 98 3 88
7 73 39 92
8 26 39 62
9 56 74 50

If you are trying to get the row-wise mininum of two or more columns, use pandas.DataFrame.min. Note that by default axis=0; specifying axis=1 is necessary.
data['min_c_h'] = data[['flow_h','flow_c']].min(axis=1)
# display(data)
flow_c flow_d flow_h min_c_h
0 82 36 43 43
1 52 48 12 12
2 33 28 77 33
3 91 99 11 11
4 44 95 27 27
5 5 94 64 5
6 98 3 88 88
7 73 39 92 73
8 26 39 62 26
9 56 74 50 50

If you like to get a single minimum value of multiple columns:
data[['flow_h','flow_c']].min().min()
the first "min()" calculates the minimum per column and returns a pandas series. The second "min" returns the minimum of the minimums per column.

Related

Python: How to find nth minimum value from a dataframe column?

Have got a dataframe like below:
Store Row_no
11 56
11 57
11 58
12 89
12 90
12 91
12 92
For each store need to get 3rd minimum value from Row_no. Expected output below.
Store Row_no
11 58
12 91
have tried df.Row_no.nsmallest(3) but it works different. Any help will be appreciated. Thank You!
Use DataFrame.sort_values with GroupBy.nth:
df = df.sort_values(['Store','Row_no']).groupby('Store', as_index=False).nth(2)
print (df)
Store Row_no
2 11 58
5 12 91

Create a BOOL column based on conditions in other columns

I have a dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(15, 4)), columns=list('ABCD'))
I would like to create another BOOL column or YES/NO column based on the sum of column A and B > 150
I am trying a generator kind of solution:
df['Truth'] = ['Yes' for i in df.columns.values if (df.A+df.B > 150)]
I know this does not work but I keep getting another error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How do I code this and what does this error mean?
How to get a column of Boolean values:
(df.A + df.B) > 150 generates a pandas.Series of Boolean values. Assign it to a column name.
import pandas as pd
import numpy as np
# sample data
np.random.seed(2)
df = pd.DataFrame(np.random.randint(0, 100, size=(15, 4)), columns=list('ABCD'))
# create the Boolean column
df['Truth'] = (df.A + df.B) > 150
# display(df)
A B C D Truth
0 40 15 72 22 False
1 43 82 75 7 False
2 34 49 95 75 False
3 85 47 63 31 False
4 90 20 37 39 False
5 67 4 42 51 False
6 38 33 58 67 False
7 69 88 68 46 True
8 70 95 83 31 True
9 66 80 52 76 False
10 50 4 90 63 False
11 79 49 39 46 False
12 8 50 15 8 False
13 17 22 73 57 False
14 90 62 83 96 True
What does this error mean:
What is shown in the question is a list-comprehension, not a generator.
(df.A + df.B) returns a pandas.Series, which can be compared to a value like 150
The issue with the list comprehension is if (df.A+df.B > 150), which causes the ValueError because there is a series, not just a single Boolean.
Another issue is df.columns.values is just a list of the column names.
See Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() for further details on the error.

Most frequently occurring numbers across multiple columns using pandas

I have a data frame with numbers in multiple columns listed by date, what I'm trying to do is find out the most frequently occurring numbers across the whole data set, also grouped by date.
import pandas as pd
import glob
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
print(lotimport()['ozlotto'])
------------- Output ---------------------
1 2 3 4 5 6 7 8 9
Date
2020-07-07 4 5 7 9 12 13 32 19 35
2020-06-30 1 17 26 28 38 39 44 14 41
2020-06-23 1 3 9 13 17 20 41 28 45
2020-06-16 1 2 13 21 22 27 38 24 33
2020-06-09 8 11 26 27 31 38 39 3 36
... .. .. .. .. .. .. .. .. ..
2005-11-15 7 10 13 17 30 32 41 20 14
2005-11-08 12 18 22 28 33 43 45 23 13
2005-11-01 1 3 11 17 24 34 43 39 4
2005-10-25 7 16 23 29 36 39 42 19 43
2005-10-18 5 9 12 30 33 39 45 7 19
The output I am aiming for is
Number frequency
45 201
32 195
24 187
14 160
48 154
--------------- Updated with append experiment -----------
I tried using append to create a single series from the dataframe, which worked for individual lines of code but got a really odd result when I ran it inside a for loop.
temp = lotimport()['ozlotto']['1']
print(temp)
temp = temp.append(lotimport()['ozlotto']['2'], ignore_index=True, verify_integrity=True)
print(temp)
temp = temp.append(lotimport()['ozlotto']['3'], ignore_index=True, verify_integrity=True)
print(temp)
lotcomb = pd.DataFrame()
for i in (lotimport()['ozlotto'].columns.tolist()):
print(f"{i} - {type(i)}")
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
print(lotcomb)
This solution might be the one you are looking for.
freqvalues = np.unique(df.to_numpy(), return_counts=True)
df2 = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
df2.index.name = "Numbers"
df2
Output:
Frequency
Numbers
1 6
2 5
3 5
5 8
6 4
7 7
8 2
9 7
10 3
11 4
12 2
13 8
14 1
15 4
16 4
17 6
18 4
19 5
20 9
21 3
22 4
23 2
24 4
25 5
26 4
27 6
28 1
29 6
30 3
31 3
... ...
70 6
71 6
72 5
73 5
74 2
75 8
76 5
77 3
78 3
79 2
80 3
81 4
82 6
83 9
84 5
85 4
86 1
87 3
88 4
89 3
90 4
91 4
92 3
93 5
94 1
95 4
96 6
97 6
98 1
99 6
97 rows × 1 columns
df.max(axis=0)
for columns
df.max(axis=1)
for index
Ok so the final answer I came up with was a mix of a few things including some of the great input from people in this thread. Essentially I do the following:
Pull in the CSV file and clean up the dates and the column names, then convert it to a pandas dataframe.
Then create a new pandas series and append each column to it ignoring dates to prevent conflicts.
Once I have the series, I use Vioxini's suggestion to use numpy to get counts of unique values and then turn the values into the index, after that sort the column by count in descending order and return the top 10 values.
Below is the resulting code, I hope it helps someone else.
import pandas as pd
import glob
import numpy as np
def lotnorm(pdobject) :
# clean up special characters in the column names and make the date column the index as a date type.
pdobject["Date"] = pd.to_datetime(pdobject["Date"])
pdobject = pdobject.set_index('Date')
for column in pdobject:
if '#' in column:
pdobject = pdobject.rename(columns={column:column.replace('#','')})
return pdobject
def lotimport() :
lotret = {}
# list files in data directory with csv filename
for lotpath in [f for f in glob.glob("data/*.csv")]:
lotname = lotpath.split('\\')[1].split('.')[0]
lotret[lotname] = lotnorm(pd.read_csv(lotpath))
return lotret
lotcomb = pd.Series([],dtype=object)
for i in (lotimport()['ozlotto'].columns.tolist()):
lotcomb = lotcomb.append(lotimport()['ozlotto'][i], ignore_index=True, verify_integrity=True)
freqvalues = np.unique(lotcomb.to_numpy(), return_counts=True)
lotop = pd.DataFrame(index=freqvalues[0], data=freqvalues[1], columns=["Frequency"])
lotop.index.name = "Numbers"
lotop.sort_values(by=['Frequency'],ascending=False).head(10)

Plot individuals home-range with Adehabitat

I am trying to put the name from the individuals of my research in a polygons home-range plot, but after many attempts I still can not achieve it.
Here and example of my data: X and Y are coordinates and id are individuals
X Y id
29 29 4
44 28 7
57 57 5
60 81 11
32 41 4
43 29 7
57 57 5
46 83 11
32 41 4
43 29 7
57 56 5
60 82 11
35 40 4
43 28 7
62 55 5
54 73 11
27 40 4
43 28 7
61 54 5
First, i calculated the home-range of my data with MPC
cp <- mcp((data)[,1],percent=95, unin = c("m"), unout = c("m2"))
And the plot it
plot(cp, axes=TRUE, border = rainbow(12))
But i don´t kown which polygons correspond to each individual, and if possible i need to include the id of my individuals inside each polygon
Any help would be appreciated!!
Thanks
Juan
Here is an example using the example data from the adehabitatHR package, since you not really providing a reproducible example.
library(adehabitatHR)
data("puechabonsp")
cp <- mcp(puechabonsp$relocs[, 1], percent=95, unin = c("m"), unout = c("m2"))
One way would be to use ggplot2 and sf:
library(sf)
library(tidyverse)
st_as_sf(cp) %>% ggplot(., aes(fill = id)) + geom_sf(alpha = 0.5) +
scale_fill_discrete(name = "Animal id")

MS Excel: how can I make Max() more efficient?

I have a set of data that looks like this:
ID Value MaxByID
0 32 80
0 80 80
0 4 80
0 68 80
0 6 80
1 32 68
1 54 68
1 56 68
1 68 68
1 44 68
2 54 92
2 52 92
2 92 92
4 68 68
4 52 68
5 74 74
5 22 74
6 52 94
6 52 94
6 46 94
6 94 94
6 56 94
6 14 94
I am using {=MAX(IF(A$2:A$100=A2,B$2:B$100))} to calculate the MaxByID column. However, the dataset has >100k rows, with mostly unique IDs: this seems to be a really inefficient way to do this, as each cell in C:C has to iterate through every cell in A:A.
The ID field is numeric and can be sorted- is there a way of more intelligently finding the MaxByID?
You may be able to use a pivot table to find the maximum for each unique ID: see this link for an example.
Once you have that table, VLOOKUP should enable you to quickly find MaxByID for each ID.
Once you have sorted by ID you could add columns to get the start row number and count for each unique. These 2 numbers allow you to calculate the size and position of the range of Unique values. So then you can use MAX(OFFSET(StartValueCell,StartThisUnique-1,0,CountThisUnique,1)) to get the max
This might be faster
{=IF(A2=A1,C1,MAX(($A$2:$A$24=A2)*($B$2:$B$24)))}
Since your data appears to be sorted, you could see if the ID matches the row above and simply copy the max down.

Resources