Excel: I want to SUM the top 2 values IF a cell in the row matches a string - excel

I've been trying to figure out how to SUM the top 2 values of an array using SUMPRODUCT but I also want to add a criteria that will only sum the product if it matches a specific string. I thought I could combine SUMPRODUCT and SUMIF but I have been unsuccessful.
Position Age ADP Trend Value
QB 23 241 84.2 21
QB 35 185 -37.5 142
QB 27 300 25 19
QB 26 300 25 19
QB 32 300 25 19
RB 22 98 -2.2 1051
RB 24 69 0.3 1929
RB 24 238 6 25
RB 26 300 25 19
RB 26 300 25 19
WR 22 300 25 19
WR 24 300 25 19
WR 26 232 -17 36
WR 25 300 25 19
WR 28 300 25 19
WR 23 9 -4.2 8591
WR 23 178 21.4 161
WR 23 38 8.5 4679
WR 26 222 102.8 53
WR 23 300 25 19
WR 26 300 25 19
TE 26 117 -18.7 617
TE 36 193 -30.3 119
TE 26 199 -22.5 105
TE 24 300 25 19
What I want is to SUM the top two values under the Value column IF the Position = QB.
How can I accomplish this?
Cheers!

Use this array formula:
=SUM(LARGE(IF(A2:A25="QB",E2:E25,""),1),LARGE(IF(A2:A25="QB",E2:E25,""),2))
Press CTRL+SHIFT+ENTER to evaluate the formula as it is an array formula.

Related

How to update column based on conditions and previous row is not equal to the same condition

How to identify Winner Week when the previous row is not equal to the current row.
Week is classified as a "Winner" when the [Weekly_counts] is greater than the [Winner_Num] and the previous week is not a Winner.
Here a copy of the final data set:
Year ISOweeknum Weekly_Counts NumOfWeeks Yearly_Count WeeklyAverage Winner_Num
0 2017 9 1561 44 12100 275 330
1 2017 10 1001 44 12100 275 330
2 2017 11 451 44 12100 275 330
3 2017 12 513 44 12100 275 330
4 2017 13 431 44 12100 275 330
... ... ... ... ... ... ... ...
232 2021 32 136 36 4212 117 140
233 2021 33 84 36 4212 117 140
234 2021 34 95 36 4212 117 140
235 2021 35 120 36 4212 117 140
236 2021 53 77 36 4212 117 140
I've tried using this code but not getting the results desired:
new_df3['Winner_Results'] = 0
for i in range(len(new_df3)-1):
if (new_df3['Weekly_Votes_Counts'].iloc[i] > new_df3['Winner_Num'].iloc[i]) & (new_df3['Weekly_Votes_Counts'].iloc[i+1] > new_df3['Winner_Num'].iloc[i+1]):
new_df3['Winner_Results'].iloc[i] = 'Not Winner'
else:
new_df3['Winner_Results'].iloc[i] = 'Winner'
.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df3['Winner_Results'].iloc[i] = 'Winner'
The Expected Result:
[![Excel Example][1]][1]
[1]: https://i.stack.imgur.com/BEcuX.png
Here's a way to get the result in your question:
df['Counts_Gt_Num'] = df.Weekly_Counts > df.Winner_Num
df['Cumsum'] = df['Counts_Gt_Num'].cumsum()
df.loc[(~df['Counts_Gt_Num']) & df['Counts_Gt_Num'].shift(), 'Subtract'] = df['Cumsum']
df['Is_Winner'] = (df['Cumsum'] - df['Subtract'].ffill().fillna(0)).mod(2)
df['Winner_Results'] = np.where(df['Is_Winner'], 'Winner', 'Not Winner')
df = df.drop(columns=['Counts_Gt_Num', 'Cumsum', 'Subtract', 'Is_Winner'])
Output:
Year ISOweeknum Weekly_Counts NumOfWeeks Yearly_Count WeeklyAverage Winner_Num Winner_Results
0 2017 9 1561 44 12100 275 330 Winner
1 2017 10 1001 44 12100 275 330 Not Winner
2 2017 11 451 44 12100 275 330 Winner
3 2017 12 513 44 12100 275 330 Not Winner
4 2017 13 431 44 12100 275 330 Winner
5 2017 14 371 44 12100 275 330 Not Winner
6 2017 15 361 44 12100 275 330 Winner
7 2017 16 336 44 12100 275 330 Not Winner
8 2017 17 332 44 12100 275 330 Winner
9 2017 18 124 44 12100 275 330 Not Winner
10 2017 19 142 44 12100 275 330 Not Winner
11 2017 20 290 44 12100 275 330 Not Winner
12 2017 21 116 44 12100 275 330 Not Winner
13 2017 22 142 44 12100 275 330 Not Winner
14 2017 23 132 44 12100 275 330 Not Winner
15 2017 24 69 44 12100 275 330 Not Winner
16 2017 25 124 44 12100 275 330 Not Winner
17 2017 26 136 44 12100 275 330 Not Winner
18 2017 27 63 44 12100 275 330 Not Winner
Explanation:
mark rows with counts > num using boolean in new column Counts_Gt_Num
put cumsum of the above in new column Cumsum
create new column Subtract by coping from Cumsum for rows where Counts_Gt_Num is False but was True in the preceding row, and forward fill using ffill() for rows with nan
create Is_Winner column selecting as winners the rows at an even offset (0, 2, 4 ...) within a streak of non-zero values in Cumsum - Subtract
create Winner_Results by assigning the desired win/no-win value based on Is_Winner
drop intermediate columns.

exception handling attempt in pandas

I am having difficulty creating two columns, "Home Score" and "Away Score", in the wikipedia table I am trying to parse.
I tried the following script with two try-except-else statements to see if that would work.
test_matches = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
test_matches = test_matches[1]
test_matches['Year'] = test_matches['Date'].str[-4:].apply(pd.to_numeric)
test_matches_worst = test_matches[(test_matches['Winner'] != 'Wales') & (test_matches['Year'] >= 2007) & (test_matches['Competition'].str.contains('Nations'))]
try:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("–").str[0].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("-").str[0].apply(pd.to_numeric)
try:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("–").str[1].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("-").str[1].apply(pd.to_numeric)
test_matches_worst['Margin'] = (test_matches_worst['Home Score'] - test_matches_worst['Away Score']).abs()
test_matches_worst.sort_values('Margin', ascending=False).reset_index(drop = True)#.head(20)
However, I would receive a Key error message and the "Home Score" is not displayed in the dataframe when shortening the code. What is the best way to handle this particular table and to generate the columns that I want? Any assistance on this would be greatly appreciated. Thanks in advance.
The problem of the data you collect is the hyphen or dash. Except the last row, all score separator are the 'En Dash' (U+2013) and not the 'Hyphen' (U+002D):
sep = r'[-\u2013]'
# df is test_matches_worst
df[['Home Score','Away Score']] = df['Score'].str.split(sep, expand=True).astype(int)
df['Margin'] = df['Home Score'].sub(df['Away Score']).abs
Output:
>>> df[['Score', 'Home Score', 'Away Score', 'Margin']]
Score Home Score Away Score Margin
565 9–19 9 19 10
566 21–9 21 9 12
567 32–21 32 21 11
568 23–20 23 20 3
593 21–16 21 16 5
595 15–17 15 17 2
602 30–17 30 17 13
604 20–26 20 26 6
605 27–12 27 12 15
614 19–26 19 26 7
618 28–9 28 9 19
644 22–30 22 30 8
656 26–3 26 3 23
658 29–18 29 18 11
666 16–21 16 21 5
679 16–16 16 16 0
682 25–21 25 21 4
693 16–21 16 21 5
694 29–13 29 13 16
696 20–18 20 18 2
704 12–6 12 6 6
705 37–27 37 27 10
732 24–14 24 14 10
733 23–27 23 27 4
734 33–30 33 30 3
736 10–14 10 14 4
737 32–9 32 9 23
739 13–24 13 24 11
745 32–30 32 30 2
753 29-7 29 7 22
Note: you will probably receive a SettingWithCopyWarning
To solve it, use test_matches = test_matches[1].copy()
Bonus
Pandas function like to_datetime, to_timedelta or to_numeric can take a Series as parameter so you can avoid apply:
test_matches['Year'] = pd.to_numeric(test_matches['Date'].str[-4:])

Create new column in data frame by interpolating other column in between a particular date range - Pandas

I have a df as shown below.
the data is like this.
Date y
0 2020-06-14 127
1 2020-06-15 216
2 2020-06-16 4
3 2020-06-17 90
4 2020-06-18 82
5 2020-06-19 70
6 2020-06-20 59
7 2020-06-21 48
8 2020-06-22 23
9 2020-06-23 25
10 2020-06-24 24
11 2020-06-25 22
12 2020-06-26 19
13 2020-06-27 10
14 2020-06-28 18
15 2020-06-29 157
16 2020-06-30 16
17 2020-07-01 14
18 2020-07-02 343
The code to create the data frame.
# Create a dummy dataframe
import pandas as pd
import numpy as np
y0 = [127,216,4,90, 82,70,59,48,23,25,24,22,19,10,18,157,16,14,343]
def initial_forecast(data):
data['y'] = y0
return data
# Initial date dataframe
df_dummy = pd.DataFrame({'Date': pd.date_range('2020-06-14', periods=19, freq='1D')})
# Dates
start_date = df_dummy.Date.iloc[1]
print(start_date)
end_date = df_dummy.Date.iloc[17]
print(end_date)
# Adding y0 in the dataframe
df_dummy = initial_forecast(df_dummy)
df_dummy
From the above I would like to interpolate the data for a particular date range.
I would like to interpolate(linear) between 2020-06-17 to 2020-06-27.
ie from 2020-06-17 to 2020-06-27 'y' values changes from 90 to 10 in 10 steps. so at an average in each step it reduces 8.
ie (90-10)/10(number of steps) = 8 in each steps
The expected output:
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343
Note: In the remaining date range y_new value should be same as y value.
I tried below code, that is not giving desired output
# Function
def df_interpolate(df, start_date, end_date):
df["Date"]=pd.to_datetime(df["Date"])
df.loc[(df['Date'] >= start_date) & (df['Date'] <= end_date), 'y_new'] = np.nan
df['y_new'] = df['y'].interpolate().round()
return df
df1 = df_interpolate(df_dummy, '2020-06-17', '2020-06-27')
With some tweaks to your function it works. np.where to create the new column, removing the = from your conditionals, and casting to int as per your expected output.
def df_interpolate(df, start_date, end_date):
df["Date"] = pd.to_datetime(df["Date"])
df['y_new'] = np.where((df['Date'] > start_date) & (df['Date'] < end_date), np.nan, df['y'])
df['y_new'] = df['y_new'].interpolate().round().astype(int)
return df
Date y y_new
0 2020-06-14 127 127
1 2020-06-15 216 216
2 2020-06-16 4 4
3 2020-06-17 90 90
4 2020-06-18 82 82
5 2020-06-19 70 74
6 2020-06-20 59 66
7 2020-06-21 48 58
8 2020-06-22 23 50
9 2020-06-23 25 42
10 2020-06-24 24 34
11 2020-06-25 22 26
12 2020-06-26 19 18
13 2020-06-27 10 10
14 2020-06-28 18 18
15 2020-06-29 157 157
16 2020-06-30 16 16
17 2020-07-01 14 14
18 2020-07-02 343 343

Choosing the values in the column based on the maximum values of other column

I am choosing the values in Pandas DataFrame.
I would like to choose the values in the columns 'One_T','Two_T','Three_T'(which means the total counts), based on the Ratios of the columns('One_R','Two_R','Three_R').
Comparing values is done by the columns('One_R','Two_R','Three_R') and choosing values will be done by columns ('One_T','Two_T','Three_T').
I would like to find the highest values among columns('One_R','Two_R','Three_R') and put values from columns 'One_T','Two_T','Three_T' in new column 'Highest'.
For example, the first row has the highest values in One_R than Two_R and Three_R.
Then, the values in One_T will be filled the column named Highest.
The initial data frame is test below code and the desired result is the result in the below code.
test = pd.DataFrame([[150,30,140,20,120,19],[170,31,130,30,180,22],[230,45,100,50,140,40],
[140,28,80,10,60,10],[100,25,80,27,50,23]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R'])
One_T One_R Two_T Two_R Three_T Three_R
2019-01-01 150 30 140 20 120 19
2019-02-01 170 31 130 30 180 22
2019-03-01 230 45 100 50 140 40
2019-04-01 140 28 80 10 60 10
2019-05-01 100 25 80 27 50 23
result = pd.DataFrame([[150,30,140,20,120,19,150],[170,31,130,30,180,22,170],[230,45,100,50,140,40,100],
[140,28,80,10,60,10,140],[100,25,80,27,50,23,80]], index=['2019-01-01','2019-02-01','2019-03-01','2019-04-01','2019-05-01'],
columns=['One_T','One_R','Two_T','Two_R','Three_T','Three_R','Highest'])
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80
Is there any way to do this?
Thank you for time and considerations.
You can solve this using df.filter to select columns with the _R suffix, then idxmax. Then replace _R with _T and use df.lookup:
s = test.filter(like='_R').idxmax(1).str.replace('_R','_T')
test['Highest'] = test.lookup(s.index,s)
print(test)
One_T One_R Two_T Two_R Three_T Three_R Highest
2019-01-01 150 30 140 20 120 19 150
2019-02-01 170 31 130 30 180 22 170
2019-03-01 230 45 100 50 140 40 100
2019-04-01 140 28 80 10 60 10 140
2019-05-01 100 25 80 27 50 23 80

Excel Add date column with dates repeated 24 times [duplicate]

This question already has answers here:
Excel add column starting at 1 and increments to 24 then resets [closed]
(2 answers)
Closed 8 years ago.
Here is a sample of my data
Hour Index Visits
0 67
1 22
2 111
3 22
4 0
5 0
6 22
7 44
8 0
9 89
10 22
11 111
12 44
13 89
14 44
15 111
16 177
17 89
18 44
19 44
20 89
21 22
22 89
23 44
24 133
25 44
26 22
27 22
28 44
29 22
30 44
31 44
32 22
What I want to do is add two columns. In one column there is the date starting at Jan 1, 2013 and repeats this date for 24 rows until it increments to the next day. Then I want another column that just displays the month of the previous column. Here is what it should look like
Hour Index Visits date month
0 67 1/1/2013 1
1 22 1/1/2013 1
2 111 1/1/2013 1
3 22 1/1/2013 1
4 0 1/1/2013 1
5 0 1/1/2013 1
6 22 1/1/2013 1
7 44 1/1/2013 1
8 0 1/1/2013 1
9 89 1/1/2013 1
10 22 1/1/2013 1
11 111 1/1/2013 1
12 44 1/1/2013 1
13 89 1/1/2013 1
14 44 1/1/2013 1
15 111 1/1/2013 1
16 177 1/1/2013 1
17 89 1/1/2013 1
18 44 1/1/2013 1
19 44 1/1/2013 1
20 89 1/1/2013 1
21 22 1/1/2013 1
22 89 1/1/2013 1
23 44 1/1/2013 1
24 133 2/1/2013 1
25 44 2/1/2013 1
26 22 2/1/2013 1
27 22 2/1/2013 1
28 44 2/1/2013 1
29 22 2/1/2013 1
30 44 2/1/2013 1
31 44 2/1/2013 1
32 22 2/1/2013 1
Suppose your Hours starts from A2. Then you can write in date column (column C):
=DATE(2013,1,1)+INT(A2/24)
and drop it down.
Next step, write in month column (Column D):
=MONTH(C2)
and drop it down.

Resources