Get the names of Top 'n' columns based on a threshold for values across a row - python-3.x

Let's say, that I have the following data:
In [1]: df
Out[1]:
Student_Name Maths Physics Chemistry Biology English
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
I want to add a column to this dataframe which tells me the students' top 'n' subjects that are above a threshold, where the subject names are available in the column names. Let's assume n=3 and threshold=80.
The output would look like the following:
In [3]: df
Out[3]:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 John Doe 90 87 81 65 70 Maths, Physics, Chemistry
1 Jane Doe 82 84 75 73 77 Physics, Maths
2 Mary Lim 40 65 55 60 70 nan
3 Lisa Ray 55 52 77 62 90 English
I tried to use the solution written by #jezrael for this question where they use numpy.argsort to get the positions of sorted values for the top 'n' columns, but I am unable to set a threshold value below which nothing should be considered.

Idea is first replace not matched values by missing values in DataFrame.where, then applied solution with numpy.argsort. Filter by number of Trues of for count non missing values in numpy.where for replace not matched values to empty strings.
Last are values joined in list comprehension and filtered out non matched rows for missing value(s):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
If performance is not important use Series.nlargest per rows, but it is really slow if large DataFrame:
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m)
.apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
print (df)
Student_Name Maths Physics Chemistry Biology English \
0 John Doe 90 87 81 65 70
1 Jane Doe 82 84 75 73 77
2 Mary Lim 40 65 55 60 70
3 Lisa Ray 55 52 77 62 90
Top_3_above_80
0 Maths, Physics, Chemistry
1 Physics, Maths
2 NaN
3 English
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
#print (df)
def f1(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
arr = df1.columns.values[np.argsort(-df1.where(m), axis=1)]
m = np.arange(arr.shape[1]) < count[:, None]
a = np.where(m, arr, '')
L = [', '.join(x).strip(', ') for x in a]
df['Top_3_above_80'] = pd.Series(L, index=df.index)[count > 0]
return df
def f2(df):
df1 = df.iloc[:, 1:]
m = df1 > 80
count = m.sum(axis=1)
df['Top_3_above_80'] = (df1.where(m).apply(lambda x: ', '.join(x.nlargest(3).index), axis=1)[count > 0])
return df
In [210]: %timeit (f1(df.copy()))
19.3 ms ± 272 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [211]: %timeit (f2(df.copy()))
2.43 s ± 61.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

An alternative:
res = []
tmp = df.set_index('Student_Name').T
for col in list(tmp):
res.append(tmp[col].nlargest(3)[tmp[col].nlargest(3) > 80].index.tolist())
res = [x if len(x) > 0 else np.NaN for x in res]
df['Top_3_above_80'] = res
Output:
Student_Name Maths Physics Chemistry Biology English Top_3_above_80
0 JohnDoe 90 87 81 65 70 [Maths, Physics, Chemistry]
1 JaneDoe 82 84 75 73 77 [Physics, Maths]
2 MaryLim 40 65 55 60 70 NaN
3 LisaRay 55 52 77 62 90 [English]

Related

Diagonals in North East Direction - Time Limit Exceeded in Python 3.8.3

The program must accept an integer matrix of size RxC as the input. The program must print the integers in the diagonals in the North-East directions of the matrix in the seprate line as output.
Boundary:
2<=R,C<=100
Time Limit : 500ms
Example 1:
Input:
3 3
73 77 76
71 17 87
37 73 98
Output:
73
71 77
37 17 76
73 87
98
Example 2:
Input:
4 6
97 78 7 39 92 45
68 100 49 95 97 100
59 41 81 22 26 100
46 37 81 12 93 10
Output:
97
68 78
59 100 7
46 41 49 39
37 81 95 92
81 22 97 45
12 26 100
93 100
10
My Code:
row,col = map(int,input().split())
matrix = [list(map(int,input().split())) for i in range(row)]
# Redundancy of row and col
rep = []
for i in range(row):
for j in range(col):
b = []
for k in range(i,row):
if (j,k) not in rep:
b.append(matrix[k][j])
rep.append((j,k))
j-=1
if j<0:break
if len(b):print(*(b[::-1]))
My code works well but when the matrix is of size (100,100) it exceeds the given time limit, is there a way to reduce it. Thanks in advance
Note : No External Libraries should be used!
The trick here is to realize that because each number only appears in the solution once, so we really only need to evaluate each value once.
We can also see that each matrix will result in row + col - 1 number of North-East direction diagonals, which will help us.
# Original code
row,col = map(int,input().split())
# I won't turn them into ints, strings actually make it easier for my work
matrix = [input().split() for i in range(row)]
diagonals = [""] * (row + col - 1)
for i in range(row):
for j in range(col):
# determine which diagonal the number belongs to, and prepend it
diagonals[i + j] = "%s %s" % (matrix[i][j], diagonals[i + j])
# print out diagonals one at a time
for diagonal in diagonals: print(diagonal)
I never got the chance to run it, but this should give the general idea!
(new to SO, plz be nice :D)

Python: SettingWithCopyWarning when trying to set value to True based on condition

Data:
Date Stock Peak Trough Price
2002-01-01 33.78 False False 25
2002-01-02 34.19 False False 35
2002-01-03 35.44 False False 33
2002-01-04 36.75 False False 38
I use this line of code to set 'Peak' to true in each row whenever the price of a stock is higher or equal to the max value in the row starting from column 4:
df['Peak'] = np.where(df.iloc[:,4:].max(axis=1) >= df[stock], 'False', 'True')
However, I'm trying to make it so that the first X and last Y rows are not affected. Let's say X and Y are both 10 in this example. I modified it like this:
df.iloc[10:-10]['Peak'] = np.where(df.iloc[10:-10,4:].max(axis=1) >= df.iloc[10:-10][stock], 'False', 'True')
This gives me an error SettingWithCopyWarning and also doesn't work anymore. Does anyone have an idea how to get the desired result so that the first X and last Y rows are always False?
I believe you need a get_loc to specify column index when assigning using df.iloc[] :
df.iloc[10:,df.columns.get_loc('year')] = (np.where(df.iloc[10:,4:].max(axis=1)
>= df.iloc[10:,df.columns.get_loc('stock')],'False', 'True'))
To try here is a test case:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,100,(5,4)),columns=list('ABCD'))
print(df)
A B C D
0 66 92 98 17
1 83 57 86 97
2 96 47 73 32
3 46 96 25 83
4 78 36 96 80
Trying to set column D as np.nan from index 2 we get the same error:
df.iloc[2:]['D']=np.nan
A value is trying to be set on a copy of a slice from a DataFrame. Try
using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
Trying the same avoiding a chained assignment using get_loc (successful)
df.iloc[2:,df.columns.get_loc('D')] = np.nan
print(df)
A B C D
0 66 92 98 17.0
1 83 57 86 97.0
2 96 47 73 NaN
3 46 96 25 NaN
4 78 36 96 NaN

Remove index from dataframe using Python

I am trying to create a Pandas Dataframe from a string using the following code -
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split('\n')])
print(df)
I am getting the following result -
0 1 2
0 A B C
1 0 34 88
2 2 45 200
3 3 47 65
4 4 32 140
5 None None
But I need something like the following -
A B C
0 34 88
2 45 200
3 47 65
4 32 140
I added "index = False" while creating the dataframe like -
df = pd.DataFrame([x.split(';') for x in data.split('\n')],index = False)
But, it gives me an error -
TypeError: Index(...) must be called with a collection of some kind, False
was passed
How is this achievable?
Use read_csv with StringIO and index_col parameetr for set first column to index:
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
df = pd.read_csv(pd.compat.StringIO(input_string),sep=';', index_col=0)
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
Your solution should be changed with split by default parameter (arbitrary whitespace), pass to DataFrame all values of lists without first with columns parameter and if need first column to index add DataFrame.set_axis:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index('A')
print (df)
B C
A
0 34 88
2 45 200
3 47 65
4 32 140
For general solution use first value of first list in set_index:
L = [x.split(';') for x in input_string.split()]
df = pd.DataFrame(L[1:], columns=L[0]).set_index(L[0][0])
EDIT:
You can set column name instead index name to A value:
df = df.rename_axis(df.index.name, axis=1).rename_axis(None)
print (df)
A B C
0 34 88
2 45 200
3 47 65
4 32 140
import pandas as pd
input_string="""A;B;C
0;34;88
2;45;200
3;47;65
4;32;140
"""
data = input_string
df = pd.DataFrame([x.split(';') for x in data.split()])
df.columns = df.iloc[0]
df = df.iloc[1:].rename_axis(None, axis=1)
df.set_index('A',inplace = True)
df
output
B C
A
0 34 88
2 45 200
3 47 65
4 32 140

Subset a data frame by Row and Column totals

I am new to Python and trying to subset a data frame of user-movie ratings first by Row Totals and next by Column Totals. The filter by column totals is taking hours to complete and so I was wondering if you could provide me some pointers to optimize the code.
data_cols = ['user_id','movie_id','rating']
data = pd.read_csv('netflix_data/TrainingRatings.txt', sep=',', names=data_cols)
utrain = (data.sort_values('user_id'))
print(utrain.tail())
Movie_Ratings = utrain.pivot_table(index = ['user_id'],columns = ['movie_id'], values = ['rating'], aggfunc = lambda x:x)
Movie_Ratings.head()
Movie_Ratings = Movie_Ratings.fillna(0)
#Filter by column totals
Movie_Ratings.loc[len(Movie_Ratings)] = [Movie_Ratings[col].sum() for col in Movie_Ratings.columns]
##Following portion is taking the maximum amount of time
x = Movie_Ratings.loc[len(Movie_Ratings)-1]
for col in Movie_Ratings.columns:
if(x[col] <= 500):
Movie_Ratings.drop(col,axis = 1, inplace = True)
First you can use DataFrame.sum only:
Movie_Ratings.loc[len(Movie_Ratings)] = Movie_Ratings.sum()
And then filter without loops:
np.random.seed(100)
Movie_Ratings = pd.DataFrame(np.random.randint(250, size=(5,5)), columns=list('ABCDE'))
print (Movie_Ratings)
A B C D E
0 8 24 67 103 87
1 79 176 138 94 180
2 98 53 66 226 14
3 34 241 240 24 143
4 228 107 60 58 144
Movie_Ratings.loc[len(Movie_Ratings)] = Movie_Ratings.sum()
Movie_Ratings = Movie_Ratings.loc[:, ~(Movie_Ratings.iloc[-1] <= 500)]
#Orchange condition to > and remove ~ for invert condition
#Movie_Ratings = Movie_Ratings.loc[:, (Movie_Ratings.iloc[-1] > 500)]
print (Movie_Ratings)
B C D E
0 24 67 103 87
1 176 138 94 180
2 53 66 226 14
3 241 240 24 143
4 107 60 58 144
5 601 571 505 568
Explanation:
print (Movie_Ratings.iloc[-1])
A 447
B 601
C 571
D 505
E 568
Name: 5, dtype: int64
print (Movie_Ratings.iloc[-1]<= 500)
A True
B False
C False
D False
E False
Name: 5, dtype: bool
print (~(Movie_Ratings.iloc[-1]<= 500))
A False
B True
C True
D True
E True
Name: 5, dtype: bool

Pandas: how to test that top-n-dataframe really results from original dataframe

I have a DataFrame, foo:
A B C D E
0 50 46 18 65 55
1 48 56 98 71 96
2 99 48 36 79 70
3 15 24 25 67 34
4 77 67 98 22 78
and another Dataframe, bar, which contains the greatest 2 values of each row of foo. All other values have been replaced with zeros, to create sparsity:
A B C D E
0 0 0 0 65 55
1 0 0 98 0 96
2 99 0 0 79 0
3 0 0 0 67 34
4 0 0 98 0 78
How can I test that every row in bar really contains the desired values?
One more thing: The solution should work with large DateFrames i.e. 20000 X 20000.
Obviously you can do that with looping and efficient sorting, but maybe a better way would be:
n = foo.shape[0]
#Test1:
#bar dataframe has original data except zeros for two values:
diff = foo - bar
test1 = ((diff==0).sum(axis=1) == 2) == n
#Test2:
#bar dataframe has 3 zeros on each line
test2 = ((bar==0).sum(axis=1) == 3) == n
#Test3:
#these 2 numbers that bar has are the max
bar2=bar.replace({0:pandas.np.nan(), inplace=True}
#the max of remaining values is smaller than the min of bar:
row_ok = (diff.max(axis=1) < bar.min(axis=1))
test3 = (ok.sum() == n)
I think this covers all cases, but haven't tested it all...

Resources