How take from string the words I need? - python-3.x

I have many strings like these.
Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019
I need to take names for example Roliffe, Tadcorp Pk Munangle, Gecester Park
And dates 15 June 2019, 10 July 2019, 26 June 2019
How can I make it?

I would use regular expressions like this:
import re
string = """Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019"""
places = re.findall(r'([\w ]*) \(.*\)', string)
dates = re.findall(r'\d{2} \w* \d{4}', string)
print(', '.join(places))
print(', '.join(dates))
Output
Roliffe, Tadcorp Pk Munangle, Gecester Park
15 June 2019, 10 July 2019, 26 June 2019

If the data follows the same pattern.
This will not be an efficient one but will work.
s = 'Roliffe (Day) - Thursday, 15 June 2019';
firstSplit = s.split('(');
name = firstSplit[0].trim();
date = firstSplit[1].split(',')[1].trim();

Related

SQL aggregating a text field based on latest date text value

I have the following script:
DROP TABLE IF EXISTS [dbo].[test]
CREATE TABLE [dbo].[test]
(
[Name] [varchar](50) NULL,
[Amount] [int] NULL
)
GO
INSERT INTO [dbo].[test]
(
[Name]
,[Amount]
)
VALUES
('Abc - 20 april to 7 june 2020',25)
,('Abc - 20 april to 7 june 2020',33)
,('Abc - 20 april to 29 june 2020',15)
,('Abc - 20 april to 29 june 2020',55)
,('Abc - 20 april to 10 may 2020',20)
,('Abc - 20 april to 10 may 2020',75)
,('Abc - 20 april to 10 may 2020',89)
GO
SELECT *
FROM [dbo].[test]
The resulting table gives the following results:
Name | Amount
-------------------------------|-------
Abc - 20 april to 7 june 2020 | 25
Abc - 20 april to 7 june 2020 | 33
Abc - 20 april to 29 june 2020 | 15
Abc - 20 april to 29 june 2020 | 55
Abc - 20 april to 10 may 2020 | 20
Abc - 20 april to 10 may 2020 | 75
Abc - 20 april to 10 may 2020 | 89
I would like to be able to determine the most recent end date in the text field and eliminate the repetitive data by grouping it showing the record with the latest end date and aggregating the amount. The result should be a single record with the following data:
Name | Amount
-------------------------------|-------
Abc - 20 april to 29 june 2020 | 312
I've played around with some group bys and text manipulation functions and this is as far as I got using the following code:
select (case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end) as [Name]
,sum(Amount) as Amount
from [dbo].[test]
group by
(case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end)
The above code doesn't really do so much as to just aggregate unique records only. I was looking for something more intelligent that will find and recognize that all of my records really mean the same thing and that the latest end date in the Name field is 29 june 2020 and will only display that single record with the total aggregated Amount.
Any help would be much appreciated.

Splitting datetime value out of text string with uneven length

System: WIN10
IDE: MS Visual Studio COde
Language: Python version 3.7.3
Library: pandas version 1.0.1
Data source: supplied in the example below
Dataset: supplied in the example below
Ask:
I need to split the date and time string out of a column from a data frame that has rows of uneven delimiters i.e. some with three and some with four commas.
I am trying to figure out how to strip the date and time values: 'Nov 11 2013 12:00AM', and 'Apr 11 2013 12:00AM' respectively off the back of these two records in one column into a new column given the second row in the example below has fewer commas.
Code:
df['sample field'].head(2) 
4457-I need, this, date, Nov 11 2013 12:00AM ,
2359-I need this, date, Apr 11 2013 12:00AM , 
While the below method expands the data into different columns and staggers which column houses the date, this does not work. I need the date and time (or even just date) information in one column so that I can use the date values in further analysis (for example time-series).
Code:
df['sample field'].str.split(",", expand=True)
Data
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df
Use df.extract with a regex epression
df['Date']= df.Text.str.extract('([A-Za-z]+\s+\d+\s+\d+\s+\d+:[0-9A-Z]+(?=\s+\,+))')
df
#df.Date=pd.to_datetime(df.Date).dt.strftime('%b %d %Y %H:%M%p')
#df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
df['Date']=pd.to_datetime(df['Date'])#This or even df['Date']=pd.to_datetime(df['Date'], format=('%b %d %Y %I:%M%p')) could work. Just remmeber because your time is 12AM use 12 clock hour system %I not %H and also hour 00.00 likely to be trncated, If have say11.00AM, the time will appear
IIUC you need str.extract with a regular expression.
Regex Demo Here
print(df)
0
0 4457-I need, this, date, Nov 11 2013 12:00AM
1 2359-I need this, date, Apr 11 2013 12:00AM
df['date'] = df[0].str.extract('(\w{3}\s\d.*\d{4}\s\d{2}:\d{2}\w{2})')
df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
print(df)
0 date
0 4457-I need, this, date, Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM 2013-04-11 12:00:00
I'll use #wwnde's data :
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df['Date'] = df.Text.str.strip(',').str.split(',').str[-1].str.strip()
df['Date_formatted'] = pd.to_datetime(df.Date, format = '%b %d %Y %H:%M%p')
Text Date Date_formatted
0 4457-I need, this, date, Nov 11 2013 12:00AM , Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM , Apr 11 2013 12:00AM 2013-04-11 12:00:00

Extracting data out of Pandas df into a list

I have a Pandas dataframe with headers and rows that contain redundant data and wanted to extract from it. For example, I have a df that looks like this:
df = pd.DataFrame({'Your availability: Wednesday, December 25th, 2019 5:00AM-6:00AM': ['Wednesday, December 25th, 2019 5:00AM-6:00AM', pd.NaN, pd.NaN, 'Wednesday, December 25th, 2019 5:00AM-6:00AM'],
'Your availability: Tuesday, December 10th 2019 8:00AM-5:00PM': [pd.NaN, 'Tuesday, December 10th 2019 8:00AM-5:00PM', pd.NaN, pd.NaN]})
...and I want to extract the dates and put it into a dictionary for reference:
datetimes = {'P1': "Wednesday, December 25th, 2019 5:00AM-6:00AM", 'P2' : "Tuesday, December 10th 2019 8:00AM-5:00PM", 'P3': NaN, 'P4': "Wednesday, December 25th, 2019 5:00AM-6:00AM}
IIUC, try this
df.ffill(1).iloc[:,-1].rename(lambda x: f'P{x+1}').to_dict()
Out[1159]:
{'P1': 'Wednesday, December 25th, 2019 5:00AM-6:00AM',
'P2': 'Tuesday, December 10th 2019 8:00AM-5:00PM',
'P3': nan,
'P4': 'Wednesday, December 25th, 2019 5:00AM-6:00AM'}
Is it what you want:
df.drop_duplicates().stack().to_list()
Output:
['Wednesday, December 25th, 2019 5:00AM-6:00AM',
'Tuesday, December 10th 2019 8:00AM-5:00PM']

Return value based on most recent "completed year"?

I have data that lists a Term Year ("A", "B", "C", ...) and some data.
A term year is a complete calendar year from that includes all 12 months.
I am trying to determine the most recent, complete, term year with a formula. (Not a UDF if possible).
Example data:
Term Month Year Misc. Data
A January 2017 32
A February 2017 35
A March 2017 448
A April 2017 747
A May 2017 656
A June 2017 370
A June 2017 1892
A July 2017 373
A August 2017 387
A August 2017 3
A August 2017 32992
A September 2017 815
A October 2017 479
A November 2017 753
A December 2017 413
B August 2018 544
B September 2018 541
B October 2018 435
B November 2018 17
B December 2018 270
B January 2018 309
B February 2018 488
(Edit: Added data, there will be multiple entries per month.)
So, since Term A is the most recent from today (being 2019) that has all months , I am just looking to have the formula return A.
As for my current attempts, I can't think of how to work an Index/Match formula. I am "afraid" I'll need a UDF, or at least some type of helper column. So far I've gotten just =Index(A2:A20 but can't think of how to build it from there. I have a hunch Aggregate() may be needed but I can't figure how.
IF you only have a single entry per month, and IF the years are sorted ascending as you show, then try:
=LOOKUP(2,1/(COUNTIFS(Table1[Year],Table1[Year])=12),Table1[[#All],[Term]])

Excel:Next year period

How to get next year period based on current month and year, for example:
Jan 2014 - Dec 2014
Feb 2014 - Jan 2015
Mar 2014 - Feb 2015
Apr 2014 - Mar 2015
May 2014 - Apr 2015
Jun 2014 - May 2015
Jul 2014 - Jun 2015
Aug 2014 - Jul 2015
Sep 2014 - Aug 2015
Oct 2014 - Sep 2015
Nov 2014 - Oct 2015
Dec 2014 - Nov 2015
Next period
Jan 2015 - Dec 2015
Feb 2015 - Jan 2016
etc.
I have tried with the following formula:
=UPPER(TEXT(NOW();"MMM")) &" "& TEXT(NOW();"YY")-1
It works fine for Jan 2014 but can't figure out how to get Dec 2014; Feb 2014 - Jan 2015 and so on?
I think you need the EOMonth formula.
=EOMONTH(NOW(),-13) +1 and =EOMONTH(NOW(),-2) +1 should give give you JAN 2014 to DEC 2014
from the MS Excel documentation
Microsoft Excel stores dates as sequential serial numbers so they can
be used in calculations. By default, January 1, 1900 is serial number
1, and January 1, 2008 is serial number 39448 because it is 39,448
days after January 1, 1900.
To get the text formatting you are after, I would suggest that you stick with formatting the cell/column as #Makyen has suggested. Having said that this is the formula that you can use to format the text.
=UPPER(TEXT(EOMONTH(NOW(),-13) +1, "MMM YY"))
Assuming that the date (as a date serial number) for which you desire to find the year period is in cell A1, the following should provide the next year period starting from that day:
=EOMONTH(A1,11) +DAY(A1) -1
Examples:
Input Output
1/18/2014 1/17/2015
2/18/2014 2/17/2015
3/18/2014 3/17/2015
4/18/2014 4/17/2015
5/18/2014 5/17/2015
6/18/2014 6/17/2015
7/18/2014 7/17/2015
8/18/2014 8/17/2015
9/18/2014 9/17/2015
10/18/2014 10/17/2015
11/18/2014 11/17/2015
12/18/2014 12/17/2015
1/18/2015 1/17/2016
2/18/2015 2/17/2016
3/18/2015 3/17/2016
4/18/2015 4/17/2016
5/18/2015 5/17/2016
6/18/2015 6/17/2016
7/18/2015 7/17/2016
8/18/2015 8/17/2016
9/18/2015 9/17/2016
10/18/2015 10/17/2016
11/18/2015 11/17/2016
12/18/2015 12/17/2016
1/18/2016 1/17/2017
If you want the year period to start from the current day:
=EOMONTH(NOW(),11) + DAY(NOW()) -1
If you want the year period to start from the first day of the current month:
=EOMONTH(EOMONTH(NOW(),-1) + 1,11)
or
=EOMONTH(NOW() - DAY(NOW()) + 1,11)
The EOMONTH() function:
EOMONTH(start_date,months)
Returns the serial number for the last day of the month that is the
indicated number of months before or after start_date. Use EOMONTH to
calculate maturity dates or due dates that fall on the last day of the
month.
If this function is not available, and returns the #NAME? error,
install and load the Analysis ToolPak add-in.

Resources