Calculate precision and recall based on values in two columns of a python pandas dataframe? - python-3.x

I have a dataframe in the following format:
Column 1 (Expected Output) | Column 2 (Actual Output)
[2,10,5,266,8] | [7,2,9,266]
[4,89,34,453] | [4,22,34,453]
I would like to find the number of items in the actual input that were expected. For example, for row 1, only 2 and 266 were in both the expected and actual output, which means that precision = 2/5 and recall = 2/5.
Since I have over 500 rows, I would like to find some sort of formula to find the precision and recall for each row.

Setting up your df like this:
df = pd.DataFrame({"Col1": [[2,10,5,266,8],[4,89,34,453]],
"Col2":[[7,2,9,266],[4,22,34,453]]})
You can find the matching values with:
df["matches"] = [set(df.loc[r, "Col1"]) & set(df.loc[r, "Col2"]) for r in range(len(df))]
from which you can calculate precision and recall.
But be warned that your example takes no account of the ordering of the elements in the expected output and actual output lists, and this solution will fall down if this is important, and also if there are duplicates of any values in the "Expected Output" list.

Related

Filter Dataframe by comparing one column to list of other columns

I have a dataframe with numerous float columns. I want to filter the dataframe, leaving only the values that are inbetween the High and Low columns of the same dataframe.
I know how to do this when the conditions are one column compared to another column. But there are 102 columns, so I cannot write a condition for each column. And all my research just illustrates how to compare two columns and not one column against all others (or I am not typing the right search terms).
I tried df= df[ (df['High'] <= df[DFColRBs]) & (df['Low'] >= df[DFColRBs])].copy() But it erases everything.
and I tried booleanselction = df[ (df[DFColRBs].between(df['High'],df['Low'])]
and I tried: df= df[(df[DFColRBs].ge(df['Low'])) & (df[DFColRBs].le(df['Low']))].copy()
and I tried:
BoolMatrix = (df[DFColRBs].ge(DF_copy['Low'], axis=0)) & (df[DFColRBs].le(DF_copy['Low'], axis=0))
df= df[BoolMatrix].copy()
But it erases everything in dataframe, even 3 columns that are not included in the list.
I appreciate the guidance.
Example Dataframe:
High Low Close _1m_21 _1m_34 _1m_55 _1m_89 _1m_144 _1m_233 _5m_21 _5m_34 _5m_55
0 1.23491 1.23456 1.23456 1.23401 1.23397 1.23391 1.2339 1.2337 1.2335 1.23392 1.23363 1.23343
1 1.23492 1.23472 1.23472 1.23422 1.23409 1.234 1.23392 1.23375 1.23353 1.23396 1.23366 1.23347
2 1.23495 1.23479 1.23488 1.23454 1.23422 1.23428 1.23416 1.23404 1.23372 1.23415 1.234 1.23367
3 1.23494 1.23472 1.23473 1.23457 1.23425 1.23428 1.23417 1.23405 1.23373 1.23415 1.234 1.23367
Based on what you've said in the comments, best to split the df into the pieces you want to operate on and the ones you don't, then use matrix operations.
tmp_df = DF_copy.iloc[:, 3:].copy()
# or tmp_df = DF_copy[DFColRBs].copy()
# mask by comparing test columns with the high and low columns
m = tmp_df.le(DF_copy['High'], axis=0) & tmp_df.ge(DF_copy['Low'], axis=0)
# combine the masked df with the original cols
DF_copy2 = pd.concat([DF_copy.iloc[:, :3], tmp_df.where(m)], axis=1)
# or replace with DF_copy.iloc[:, :3] with DF_copy.drop(columns=DFColRBs)

Summary statistics for each group and transpose using pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,11,11,11,11,11,12,12,12],
'time' :[0,0,0,1,2,3,4,4,0,0,1],
'value':[101,102,np.nan,120,143,153,160,170,96,97,99]})
What I would like to do is
a) Get the summary statistics for each subject for each time point (ex: 0hr, 1hr, 2hr etc)
b) Please note that NA rows shouldn't be counted as separate record/row during computing mean
I was trying the below
for i in df['subject_id'].unique()
df[df['subject_id'].isin([i])].time.unique
val_mean = df.groupby(['subject_id','time']][value].mean()
val_stddev = df[value].std()
But I couldn't get the expected output
I expect my output to be like as shown below where I expect one row for each time point (ex: 0hr, 1 hr , 2hr, 3hr etc). Please note that NA rows shouldn't be counted as seperated record/row during computing mean

Scaling the data still gives NaN in python

The problem is that after removing all the columns in my data with equal values on the rows and applying the formula for max-min scaling which is: (x-x.min())/(x.max()-x.min()) I still got a column with NaNs.
P.S. I removed these constant columns because if I keep them and then do the scaling, x.max()-x.min() will be 0 and all these columns after the scaling will have NaN values.
So, what I am doing is the following:
I have train and test data sets separately. Once I import them in jupyter notebook, I create a function to show me which columns have exactly the same values on the rows.
def uniques(df):
for e in df.columns:
if len(pd.unique(df[e]))==1:
yield e
Then I check which are the constant columns:
col_test_=uniques(test_st1)
col_test=list(col_test_)
col_test
Result:
['uswrf_s1_3','uswrf_s1_6','uswrf_s1_10','uswrf_s1_11','uswrf_s1_13','uswrf_s1_5']
Then I get all indices of these columns:
for i in list(col_test):
idx_col=test_st1.columns.get_loc(i)
print ("All values in column {} of the test data are the same".format(idx_col))
Result:
All values in column 220 of the test data are the same
All values in column 445 of the test data are the same
All values in column 745 of the test data are the same
All values in column 820 of the test data are the same
All values in column 970 of the test data are the same
All values in column 1120 of the test data are the same
Then I drop these columns because I would not need them after I apply the min- max scaling.
for j in col_test:
test_st1=test_st1.drop(j,1)
Basically I do the same for the train partition.
Next I apply the formula for max min scaling to both train and test partitions with respect to the train data:
train_1= (train_st1-train_st1.min())/(train_st1.max()-train_st1.min())
test_1 = (test_st1-train_st1.min())/(train_st1.max()-train_st1.min())
After I got rid of the columns with the same values I supposed that there won't be any columns with NaNs after the normalization. However, when I check if there is any column with NaN values then the following happens:
a=uniques(test_1)
b=list(a)
b
Result:
['uswrf_s1_3']
Checking which column is that:
test_1.columns.get_loc('uswrf_s1_3')
Result:
1126
How come I got a column with NaNs after the scaling bearing in mind that I got rid of all the columns whose values on the rows are completely the same?

Pandas sort not maintaining sort

What is the right way to multiply two sorted pandas Series?
When I run the following
import pandas as pd
x = pd.Series([1,3,2])
x.sort()
print(x)
w = [1]*3
print(w*x)
I get what I would expect - [1,2,3]
However, when I change it to a Series:
w = pd.Series(w)
print(w*x)
It appears to multiply based on the index of the two series, so it returns [1,3,2]
Your results are essentially the same, just sorted differently.
>>> w*x
0 1
2 2
1 3
>>> pd.Series(w)*x
0 1
1 3
2 2
>>> (w*x).sort_index()
0 1
1 3
2 2
The rule is basically this: Anytime you multiply a dataframe or series by a dataframe or series, it will be done by index. That's what makes it pandas and not numpy. As a result, any pre-sorting is necessarily ignored.
But if you multiply a dataframe or series by a list or numpy array of a conforming shape/size, then the list or array will be treated as having the exact same index as the dataframe or series. The pre-sorting of the series or dataframe can be preserved in this case because there can not be any conflict with the list or array (which don't have an index at all).
Both of these types of behavior can be very desirable depending on what you are trying do. That's why you will often see answers here that do something like df1 * df2.values when the second type of behavior is desired.
In this example, it doesn't really matter because your list is [1,1,1] and gives the same answer either way, but if it was [1,2,3] you would get different answers, not just differently sorted answers.

Replace Number that falls Between Two Values (Pandas,Python3)

Simple Question Here:
b = 8143.1795845088482
d = 14723.523658084257
My Df called final:
Words score
This 90374.98788
is 80559.4495
a 43269.67002
sample 34535.01172
output Very Low
I want to replace all the scores with either 'very low', 'low', 'medium', or 'high' based on whether they fall between quartile ranges.
something like this works:
final['score'][final['score'] <= b] = 'Very Low' #This is shown in the example above
but when I try to play this immediately after it doesn't work:
final['score'][final['score'] >= b] and final['score'][final['score'] <= d] = 'Low'
This gives me the error: cannot assign operator. Anyone know what I am missing?
Firstly you must use the bitwise (e.g. &, | instead of and , or) operators as you are comparing arrays and therefore all the values and not a single value (it becomes ambiguoous to compare arrays like this plus you cannot override the global and operator to behave like you want), secondly you must use parentheses around multiple conditions due to operator precendence.
Finally you are performing chain indexing which may or may not work and will raise a warning, to set your column value use loc like this:
In [4]:
b = 25
d = 50
final.loc[(final['score'] >= b) & (final['score'] <= d), 'score'] = 'Low'
final
Out[4]:
Words score
0 This 10
1 is Low
2 for Low
3 You 704
If your DataFrame's scores were all floats,
In [234]: df
Out[234]:
Words score
0 This 90374.98788
1 is 80559.44950
2 a 43269.67002
3 sample 34535.01172
then you could use pd.qcut to categorize each value by its quartile:
In [236]: df['quartile'] = pd.qcut(df['score'], q=4, labels=['very low', 'low', 'medium', 'high'])
In [237]: df
Out[237]:
Words score quartile
0 This 90374.98788 high
1 is 80559.44950 medium
2 a 43269.67002 low
3 sample 34535.01172 very low
DataFrame columns have a dtype. When the values are all floats, then it has a float dtype, which can be very fast for numerical calculations. When the values are a mixture of floats and strings then the dtype is object, which mean each value is a Python object. While this gives the values a lot of flexibility, it is also very slow since every operation ultimately resorts back to calling a Python function instead of a NumPy/Panda C/Fortran/Cython function. Thus you should try to avoid mixing floats and strings in a single column.

Resources