How to extract and create new columns from specific match - python-3.x

I have a column bike_name and I want to know the easiest way to split it into year and CC.
CC should contain the numeric data attached before the word cc. In some cases, where cc is not available, it should remain blank.
While year contains just the year in the last word.
TVS Star City Plus Dual Tone 110cc 2018
Royal Enfield Classic 350cc 2017
Triumph Daytona 675R 2013
TVS Apache RTR 180cc 2017
Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018
Yamaha FZs 150cc 2015

You can extract them separately: year is the last 4 characters, CC is via a regex:
df["year"] = df.bike_name.str[-4:]
df["CC"] = df.bike_name.str.extract(r"(\d+)cc").fillna("")
where regex is looking for sequence of digits followed literally by "cc" and in case of no match, it will give NaNs; so we fill them with empty string,
to get
bike_name year CC
0 TVS Star City Plus Dual Tone 110cc 2018 2018 110
1 Royal Enfield Classic 350cc 2017 2017 350
2 Triumph Daytona 675R 2013 2013
3 TVS Apache RTR 180cc 2017 2017 180
4 Yamaha FZ S V 2.0 150cc-Ltd. Edition 2018 2018 150
5 Yamaha FZs 150cc 2015 2015 150
If not only extraction but also removal is needed:
df.bike_name = (df.bike_name.str[:-4]
.str.replace(r"\d+cc", "", regex=True)
.str.rstrip())
where first line removes year, second line removes the cc parts and lastly we right strip all the rows if space at the end is unwanted,
to get
>>> df
bike_name year CC
0 TVS Star City Plus Dual 2018 110
1 Royal Enfield Cla 2017 350
2 Triumph Daytona 2013
3 TVS Apache 2017 180
4 Yamaha FZ S V 2.0 -Ltd. Edi 2018 150
5 Yamaha 2015 150

Related

SQL aggregating a text field based on latest date text value

I have the following script:
DROP TABLE IF EXISTS [dbo].[test]
CREATE TABLE [dbo].[test]
(
[Name] [varchar](50) NULL,
[Amount] [int] NULL
)
GO
INSERT INTO [dbo].[test]
(
[Name]
,[Amount]
)
VALUES
('Abc - 20 april to 7 june 2020',25)
,('Abc - 20 april to 7 june 2020',33)
,('Abc - 20 april to 29 june 2020',15)
,('Abc - 20 april to 29 june 2020',55)
,('Abc - 20 april to 10 may 2020',20)
,('Abc - 20 april to 10 may 2020',75)
,('Abc - 20 april to 10 may 2020',89)
GO
SELECT *
FROM [dbo].[test]
The resulting table gives the following results:
Name | Amount
-------------------------------|-------
Abc - 20 april to 7 june 2020 | 25
Abc - 20 april to 7 june 2020 | 33
Abc - 20 april to 29 june 2020 | 15
Abc - 20 april to 29 june 2020 | 55
Abc - 20 april to 10 may 2020 | 20
Abc - 20 april to 10 may 2020 | 75
Abc - 20 april to 10 may 2020 | 89
I would like to be able to determine the most recent end date in the text field and eliminate the repetitive data by grouping it showing the record with the latest end date and aggregating the amount. The result should be a single record with the following data:
Name | Amount
-------------------------------|-------
Abc - 20 april to 29 june 2020 | 312
I've played around with some group bys and text manipulation functions and this is as far as I got using the following code:
select (case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end) as [Name]
,sum(Amount) as Amount
from [dbo].[test]
group by
(case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end)
The above code doesn't really do so much as to just aggregate unique records only. I was looking for something more intelligent that will find and recognize that all of my records really mean the same thing and that the latest end date in the Name field is 29 june 2020 and will only display that single record with the total aggregated Amount.
Any help would be much appreciated.

Create New DataFrame Columns Based on Year

I have a pandas DataFrame that contains NFL Quarterback Data from the 2015-2016 to the 2019-2020 Seasons. The DataFrame looks like this
Player Season End Year YPG TD
Tom Brady 2019 322.6 25
Tom Brady 2018 308.1 26
Tom Brady 2017 295.7 24
Tom Brady 2016 308.7 28
Aaron Rodgers 2019 360.4 30
Aaron Rodgers 2018 358.8 33
Aaron Rodgers 2017 357.9 35
Aaron Rodgers 2016 355.2 32
I want to be able to create new columns that contains the years' data I select and the last three years' data. For example if the year I select is 2019 the resulting DataFrame would be(SY stands for selected year:
Player Season End Year YPG SY YPG SY-1 YPG SY-2 YPG SY-3 TD
Tom Brady 2019 322.6 308.1 295.7 308.7 25
Aaron Rodgers 2019 360.4 358.8 357.9 355.2 30
This is how I am attempting to do it:
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']), 'YPG SY'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-1), 'YPG SY-1'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-2), 'YPG SY-2'] = NFL_Data['YPG']
NFL_Data.loc[NFL_Data['Season End Year'] == (NFL_Data['SY']-3), 'YPG SY-3'] = NFL_Data['YPG']
However, when I run the code above, it doesn't fill out the columns appropriately. Most of the rows are 0. Am I approaching the problem the right way or is there a better way to attack it?
(Edited to include TD Column)
First step is to pivot your data frame.
pivoted = df.pivot_table(index='Player', columns='Season End Year', values='YPG')
Which yields
Season End Year 2016 2017 2018 2019
Player
Aaron Rodgers 355.2 357.9 358.8 360.4
Tom Brady 308.7 295.7 308.1 322.6
Then, you may select:
pivoted.loc[:, range(year, year-3, -1)]
2019 2018 2017
Player
Aaron Rodgers 360.4 358.8 357.9
Tom Brady 322.6 308.1 295.7
Or alternatively as suggested by Quang:
pivoted.loc[:, year:year-3:-1]

Return value based on most recent "completed year"?

I have data that lists a Term Year ("A", "B", "C", ...) and some data.
A term year is a complete calendar year from that includes all 12 months.
I am trying to determine the most recent, complete, term year with a formula. (Not a UDF if possible).
Example data:
Term Month Year Misc. Data
A January 2017 32
A February 2017 35
A March 2017 448
A April 2017 747
A May 2017 656
A June 2017 370
A June 2017 1892
A July 2017 373
A August 2017 387
A August 2017 3
A August 2017 32992
A September 2017 815
A October 2017 479
A November 2017 753
A December 2017 413
B August 2018 544
B September 2018 541
B October 2018 435
B November 2018 17
B December 2018 270
B January 2018 309
B February 2018 488
(Edit: Added data, there will be multiple entries per month.)
So, since Term A is the most recent from today (being 2019) that has all months , I am just looking to have the formula return A.
As for my current attempts, I can't think of how to work an Index/Match formula. I am "afraid" I'll need a UDF, or at least some type of helper column. So far I've gotten just =Index(A2:A20 but can't think of how to build it from there. I have a hunch Aggregate() may be needed but I can't figure how.
IF you only have a single entry per month, and IF the years are sorted ascending as you show, then try:
=LOOKUP(2,1/(COUNTIFS(Table1[Year],Table1[Year])=12),Table1[[#All],[Term]])

Sum IFs of total count without recounting Multiple instances, only the closest date prior to the AS OF DATE

I need a formula that will SUM the amount of, let's say, animal types AS OF DATE given WITHOUT adding the previous animal type count, only for the closest date prior to or on the AS OF DATE. Different animal types maybe added to or taken away. So list is not set.
I prefer not to do this in VBA or with a Pivot Table, But any help will be appreciated.
A B C
DATE ANIMAL TYPE COUNT
JAN 01 DOG 1
JAN 02 CAT 2
JAN 04 Fish 1
JAN 12 DOG 2
JAN 20 CAT 3
FEB 01 PIG 1
FEB 02 CAT 2
AS OF DATE TOTAL ANIMALS
JAN 03 3
JAN 13 5
JAN 21 6
FEB 01 7
FEB 02 6
So.
As of Jan 03, there was 3 animals total. 1 Dog and 2 cats.
As of Jan 13, there was 5 animals total. 2 Dogs, 1 Fish and 2 Cats,,,,,, NOT 6
As of Jan 21, there was 6 animals total. 2 Dogs, 1 Fish and 3 Cats,,,,,, NOT 9
As of Feb 01, there was 7 animals total. 2 Dogs, 1 Fish 1 Pig and 3 Cats, NOT 10
So far this is what I have. By using a helper column to filter the Animal Types I get a list without duplicates. Then I put that in a cell with Data Validation to pick the Type. Same for the Dates. However I would like to drop the Type input and just choose the Date. And be able to get a total.
Here is what works but not what I need.
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=$G$2,IF(TabData1[Date]<=$F$2,TabData1[Date]))))
I want to do away with the Single Cell reference ($F$2) of a single Animal Type and replace it with a Range to get the latest count of Animals for all Animal Types as of a certain date. Like this but this does not work.
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=(OFFSET($J$2,0,0,COUNT(IF(ListAnimalType="","",1)),1)),IF(TabData1[Date]<=$F$2,TabData1[Date]))))
To simplify (OFFSET($J$2,0,0,COUNT(IF(ListAnimalType="","",1)),1)) you can use $J$2:$J$5
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF(TabData1[Animal Type]=$J$2:$J$5,IF(TabData1[Date]<=$F$2,TabData1[Date]))))
And it looks like this
=SUMIFS(TabData1[Count],TabData1[Date],MAX(IF({"Dog";"Cat";"Fish";"Dog";"Cat";"Pig";"Cat";0;0;0;0;0;0;0;0;0}={"Cat";"Dog";"Fish";"Pig"},IF(TabData1[Date]<=$F$2,TabData1[Date]))))
Like I said, I want one formula that will take each Animal Type find the latest date from a specified cell and return the sum for each Animal Type then sum them all up.

Merging and Adding Data in Excel Worksheets

I have 8 sheets of data (from Dec 2014 to July 2015, separated month wise). Each sheet contains monthly data (e.g. Dec 2014 sheet contains data of dec 2014 in three columns namely AC #, Name, Amount).
Dec 2014 Contains Data as Mentioned Below:
A/C # Name Dec 2014
A12 ABC 100
A13 CBA 200
A14 BCA 300
Whereas January 2015 contains data as below
A/C # Name Dec 2014
A12 ABC 5
A13 CBA 300
*A15 IJK 900*
All sheets contains mostly same data but some additional data based on customers added in that month or amount. E.g. January 2015 may contain an additional client a/c #, name and amount of January 2015 as marked above.
I want a consolidated sheet of data where all data is arranged as below:
A/C # Name Dec 2014 Jan 2015 Feb 2015 Mar 2015 Apr 2015
A12 ABC 100 5
A13 CBA 200 300
A14 BCA 300 0
A15 IJK 0 900
I would suggest connecting to the worksheets using ADODB. Then you can issue an SQL statement that will merge the records together.
This could be run from a VBScript, or from Excel.
For a similar strategy, see here.

Resources