Create New DataFrame Columns Based on Year - python-3.x

I have a pandas DataFrame that contains NFL Quarterback Data from the 2015-2016 to the 2019-2020 Seasons. The DataFrame looks like this
Player Season End Year YPG TD
Tom Brady 2019 322.6 25
Tom Brady 2018 308.1 26
Tom Brady 2017 295.7 24
Tom Brady 2016 308.7 28
Aaron Rodgers 2019 360.4 30
Aaron Rodgers 2018 358.8 33
Aaron Rodgers 2017 357.9 35
Aaron Rodgers 2016 355.2 32
I want to be able to create new columns that contains the years' data I select and the last three years' data. For example if the year I select is 2019 the resulting DataFrame would be(SY stands for selected year:
Player Season End Year YPG SY YPG SY-1 YPG SY-2 YPG SY-3 TD
Tom Brady 2019 322.6 308.1 295.7 308.7 25
Aaron Rodgers 2019 360.4 358.8 357.9 355.2 30
This is how I am attempting to do it:
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']), 'YPG SY'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-1), 'YPG SY-1'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-2), 'YPG SY-2'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-3), 'YPG SY-3'] = NFL_Data['YPG']
However, when I run the code above, it doesn't fill out the columns appropriately. Most of the rows are 0. Am I approaching the problem the right way or is there a better way to attack it?
(Edited to include TD Column)

First step is to pivot your data frame.
pivoted = df.pivot_table(index='Player', columns='Season End Year', values='YPG')
Which yields
Season End Year 2016 2017 2018 2019
Player
Aaron Rodgers 355.2 357.9 358.8 360.4
Tom Brady 308.7 295.7 308.1 322.6
Then, you may select:
pivoted.loc[:, range(year, year-3, -1)]
2019 2018 2017
Player
Aaron Rodgers 360.4 358.8 357.9
Tom Brady 322.6 308.1 295.7
Or alternatively as suggested by Quang:
pivoted.loc[:, year:year-3:-1]

Related

How to extract and create new columns from specific match

I have a column bike_name and I want to know the easiest way to split it into year and CC.
CC should contain the numeric data attached before the word cc. In some cases, where cc is not available, it should remain blank.
While year contains just the year in the last word.
TVS Star City Plus Dual Tone 110cc 2018
Royal Enfield Classic 350cc 2017
Triumph Daytona 675R 2013
TVS Apache RTR 180cc 2017
Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018
Yamaha FZs 150cc 2015
You can extract them separately: year is the last 4 characters, CC is via a regex:
df["year"] = df.bike_name.str[-4:]
df["CC"] = df.bike_name.str.extract(r"(\d+)cc").fillna("")
where regex is looking for sequence of digits followed literally by "cc" and in case of no match, it will give NaNs; so we fill them with empty string,
to get
bike_name year CC
0 TVS Star City Plus Dual Tone 110cc 2018 2018 110
1 Royal Enfield Classic 350cc 2017 2017 350
2 Triumph Daytona 675R 2013 2013
3 TVS Apache RTR 180cc 2017 2017 180
4 Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018 2018 150
5 Yamaha FZs 150cc 2015 2015 150
If not only extraction but also removal is needed:
df.bike_name = (df.bike_name.str[:-4]
.str.replace(r"\d+cc", "", regex=True)
.str.rstrip())
where first line removes year, second line removes the cc parts and lastly we right strip all the rows if space at the end is unwanted,
to get
>>> df
bike_name year CC
0 TVS Star City Plus Dual 2018 110
1 Royal Enfield Cla 2017 350
2 Triumph Daytona 2013
3 TVS Apache 2017 180
4 Yamaha FZ S V 2.0 -Ltd. Edi 2018 150
5 Yamaha 2015 150

How to replace values between columns based on condition in pandas?

I have a dataframe of the form
ID Effective_Date Paid_Off_Time
xqd27070601 09 August 2016 10 July 2016
xqd21601070 09 September 2016 10 July 2016
xqd26010760 10 July 2016 09 November 2016
EDIT
Originally, the dates shown are of type String. Their format can be: like this 9/18/2016 16:56, 09 August 2016, 9/18/2016. Should we consider converting to timestamp for easier comparison?
What I want
if Effective_Date > Paid_Off_Time, replace value of Effective_DatewithPaid_Off_Timeand the value ofPaid_Off_TimewithEffective_Date```.
Basically, switch the values between the 2 columns because the date was insert in the wrong column.
I have thought about using np.where, but I am wondering, isn't there a less verbose, cleaner solution?
#create a new dataFrame
testDf = pd.DataFrame(columns=['Effective_Date','Paid_Off_Time'])
#check if Effective_Date < myDataFrame
testDf['Effective_Date'] = np.where(myDataFrame.Effective_Date < myDataFrame.Paid_Off_Time,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
#check if Paid_Off_Time < Effective_Date
testDf['Paid_Off_Time'] = np.where(myDataFrame.Paid_Off_Time < myDataFrame.Effective_Date,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
myDataFrame['Effective_Date'] = testDf[testDf['Effective_Date']]
myDataFrame['Paid_Off_Time'] = testDf[testDf['Paid_Off_Time']]
Convert dates to datetime
df=df.assign(Effective_Date=pd.to_datetime(df['Effective_Date'], format='%d %B %Y'),Paid_Off_Time=pd.to_datetime(df['Paid_Off_Time'], format='%d %B %Y'))
Select as per condition
m=df.Effective_Date>df.Paid_Off_Time
Swap values if condition met
df.loc[m, ['Effective_Date','Paid_Off_Time']]=df.loc[m, ['Paid_Off_Time','Effective_Date']].values#Swap rows if condition met
print(df)
ID Effective_Date Paid_Off_Time
0 xqd27070601 09 August 2016 10 July 2016
1 xqd21601070 09 September 2016 10 July 2016
2 xqd26010760 09 November 2016 10 July 2016
I am sharing a piece of my project code in which i did somewhat similar thing, I hope this kind of implementation will give you the solution.
df['Effective_date'] = pd.to_datetime(df['Effective_date'], format= '%d/%m/%Y')
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'], format= '%d/%m/%Y')
for i in range(0,len(df))
if df['Effective_Date'][i]>df['Paid_Off_Time'][i]:
k=df['Effective_Date'][i]
df['Effective_Date'][i]=df['Paid_Off_Time'][i]
df['Paid_Off_Time'][i]=k
You can try sorting values in numpy to improve performance:
df['Effective_Date'] = pd.to_datetime(df['Effective_Date'])
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'])
c = ['Effective_Date','Paid_Off_Time']
data = np.sort(myDataFrame[c].to_numpy(), axis=1)
myDataFrame[c] = pd.DataFrame(data, columns=c)
print (myDataFrame)
ID Effective_Date Paid_Off_Time
0 xqd27070601 2016-07-10 2016-08-09
1 xqd21601070 2016-07-10 2016-09-09
2 xqd26010760 2016-07-10 2016-11-09

Copy row of data from one pandas dataframe to another

A pandas newbie here. I imported an excel data into pandas, I want to copy subset of data of a specific row (placeholder) from one dataframe (Error_data1) to another dataframe (Error_data2) where the 'placeholder' exists.
Here is the first 4 rows of Error_data1 (it has 150 rows)
index student Error1 Error2 Error3 Error4 Error5
0 Henry 2.5647 -0.2145 1.3524 2.0124 6.2013
1 John -0.0124 1.0365 3.2145 4.0211 -5.0124
2 Terry 1.1120 2.2154 -6.2013 1.2032 2.3321
3 Gerald 9.2105 1.0212 3.2548 3.6478 4.1020
Here is the first 5 rows of Error_data2 (it has 358 rows)
index Day Time student Error1 Error2 Error3 Error4 Error5
0 Mon 01:00 Terry
1 Tue 05:15 John
2 Wed 05:25 john
3 Wed 12:15 Gerald
4 Thur 11:00 Henry
Here is the code i tried
for i in range(len(Error_data1)):
if Error_data1['Student'][i] == Error_data2['Student'][i]:
a = Error_data1.iloc[i,1:6]
Error_data2.iloc[i,4:9] = a
I expect Error_data2 to look like this:
index Day Time student Error1 Error2 Error3 Error4 Error5
0 Mon 01:00 Terry 1.1120 2.2154 -6.2013 1.2032 2.3321
1 Tue 05:15 John -0.0124 1.0365 3.2145 4.0211 -5.0124
2 Wed 05:25 john -0.0124 1.0365 3.2145 4.0211 -5.0124
3 Wed 12:15 Gerald 9.2105 1.0212 3.2548 3.6478 4.1020
4 Thur 11:00 Henry 2.5647 -0.2145 1.3524 2.0124 6.2013
You can try merging the two dataframes on student names.
combined = Error_data1.merge(Error_data2, on='student', how='left').fillna(0)

How take from string the words I need?

I have many strings like these.
Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019
I need to take names for example Roliffe, Tadcorp Pk Munangle, Gecester Park
And dates 15 June 2019, 10 July 2019, 26 June 2019
How can I make it?
I would use regular expressions like this:
import re
string = """Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019"""
places = re.findall(r'([\w ]*) \(.*\)', string)
dates = re.findall(r'\d{2} \w* \d{4}', string)
print(', '.join(places))
print(', '.join(dates))
Output
Roliffe, Tadcorp Pk Munangle, Gecester Park
15 June 2019, 10 July 2019, 26 June 2019
If the data follows the same pattern.
This will not be an efficient one but will work.
s = 'Roliffe (Day) - Thursday, 15 June 2019';
firstSplit = s.split('(');
name = firstSplit[0].trim();
date = firstSplit[1].split(',')[1].trim();

Return value based on most recent "completed year"?

I have data that lists a Term Year ("A", "B", "C", ...) and some data.
A term year is a complete calendar year from that includes all 12 months.
I am trying to determine the most recent, complete, term year with a formula. (Not a UDF if possible).
Example data:
Term Month Year Misc. Data
A January 2017 32
A February 2017 35
A March 2017 448
A April 2017 747
A May 2017 656
A June 2017 370
A June 2017 1892
A July 2017 373
A August 2017 387
A August 2017 3
A August 2017 32992
A September 2017 815
A October 2017 479
A November 2017 753
A December 2017 413
B August 2018 544
B September 2018 541
B October 2018 435
B November 2018 17
B December 2018 270
B January 2018 309
B February 2018 488
(Edit: Added data, there will be multiple entries per month.)
So, since Term A is the most recent from today (being 2019) that has all months , I am just looking to have the formula return A.
As for my current attempts, I can't think of how to work an Index/Match formula. I am "afraid" I'll need a UDF, or at least some type of helper column. So far I've gotten just =Index(A2:A20 but can't think of how to build it from there. I have a hunch Aggregate() may be needed but I can't figure how.
IF you only have a single entry per month, and IF the years are sorted ascending as you show, then try:
=LOOKUP(2,1/(COUNTIFS(Table1[Year],Table1[Year])=12),Table1[[#All],[Term]])

Resources