I have a Pandas dataframe with headers and rows that contain redundant data and wanted to extract from it. For example, I have a df that looks like this:
df = pd.DataFrame({'Your availability: Wednesday, December 25th, 2019 5:00AM-6:00AM': ['Wednesday, December 25th, 2019 5:00AM-6:00AM', pd.NaN, pd.NaN, 'Wednesday, December 25th, 2019 5:00AM-6:00AM'],
'Your availability: Tuesday, December 10th 2019 8:00AM-5:00PM': [pd.NaN, 'Tuesday, December 10th 2019 8:00AM-5:00PM', pd.NaN, pd.NaN]})
...and I want to extract the dates and put it into a dictionary for reference:
datetimes = {'P1': "Wednesday, December 25th, 2019 5:00AM-6:00AM", 'P2' : "Tuesday, December 10th 2019 8:00AM-5:00PM", 'P3': NaN, 'P4': "Wednesday, December 25th, 2019 5:00AM-6:00AM}
IIUC, try this
df.ffill(1).iloc[:,-1].rename(lambda x: f'P{x+1}').to_dict()
Out[1159]:
{'P1': 'Wednesday, December 25th, 2019 5:00AM-6:00AM',
'P2': 'Tuesday, December 10th 2019 8:00AM-5:00PM',
'P3': nan,
'P4': 'Wednesday, December 25th, 2019 5:00AM-6:00AM'}
Is it what you want:
df.drop_duplicates().stack().to_list()
Output:
['Wednesday, December 25th, 2019 5:00AM-6:00AM',
'Tuesday, December 10th 2019 8:00AM-5:00PM']
Related
I have a dataframe of the form
ID Effective_Date Paid_Off_Time
xqd27070601 09 August 2016 10 July 2016
xqd21601070 09 September 2016 10 July 2016
xqd26010760 10 July 2016 09 November 2016
EDIT
Originally, the dates shown are of type String. Their format can be: like this 9/18/2016 16:56, 09 August 2016, 9/18/2016. Should we consider converting to timestamp for easier comparison?
What I want
if Effective_Date > Paid_Off_Time, replace value of Effective_DatewithPaid_Off_Timeand the value ofPaid_Off_TimewithEffective_Date```.
Basically, switch the values between the 2 columns because the date was insert in the wrong column.
I have thought about using np.where, but I am wondering, isn't there a less verbose, cleaner solution?
#create a new dataFrame
testDf = pd.DataFrame(columns=['Effective_Date','Paid_Off_Time'])
#check if Effective_Date < myDataFrame
testDf['Effective_Date'] = np.where(myDataFrame.Effective_Date < myDataFrame.Paid_Off_Time,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
#check if Paid_Off_Time < Effective_Date
testDf['Paid_Off_Time'] = np.where(myDataFrame.Paid_Off_Time < myDataFrame.Effective_Date,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
myDataFrame['Effective_Date'] = testDf[testDf['Effective_Date']]
myDataFrame['Paid_Off_Time'] = testDf[testDf['Paid_Off_Time']]
Convert dates to datetime
df=df.assign(Effective_Date=pd.to_datetime(df['Effective_Date'], format='%d %B %Y'),Paid_Off_Time=pd.to_datetime(df['Paid_Off_Time'], format='%d %B %Y'))
Select as per condition
m=df.Effective_Date>df.Paid_Off_Time
Swap values if condition met
df.loc[m, ['Effective_Date','Paid_Off_Time']]=df.loc[m, ['Paid_Off_Time','Effective_Date']].values#Swap rows if condition met
print(df)
ID Effective_Date Paid_Off_Time
0 xqd27070601 09 August 2016 10 July 2016
1 xqd21601070 09 September 2016 10 July 2016
2 xqd26010760 09 November 2016 10 July 2016
I am sharing a piece of my project code in which i did somewhat similar thing, I hope this kind of implementation will give you the solution.
df['Effective_date'] = pd.to_datetime(df['Effective_date'], format= '%d/%m/%Y')
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'], format= '%d/%m/%Y')
for i in range(0,len(df))
if df['Effective_Date'][i]>df['Paid_Off_Time'][i]:
k=df['Effective_Date'][i]
df['Effective_Date'][i]=df['Paid_Off_Time'][i]
df['Paid_Off_Time'][i]=k
You can try sorting values in numpy to improve performance:
df['Effective_Date'] = pd.to_datetime(df['Effective_Date'])
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'])
c = ['Effective_Date','Paid_Off_Time']
data = np.sort(myDataFrame[c].to_numpy(), axis=1)
myDataFrame[c] = pd.DataFrame(data, columns=c)
print (myDataFrame)
ID Effective_Date Paid_Off_Time
0 xqd27070601 2016-07-10 2016-08-09
1 xqd21601070 2016-07-10 2016-09-09
2 xqd26010760 2016-07-10 2016-11-09
System: WIN10
IDE: MS Visual Studio COde
Language: Python version 3.7.3
Library: pandas version 1.0.1
Data source: supplied in the example below
Dataset: supplied in the example below
Ask:
I need to split the date and time string out of a column from a data frame that has rows of uneven delimiters i.e. some with three and some with four commas.
I am trying to figure out how to strip the date and time values: 'Nov 11 2013 12:00AM', and 'Apr 11 2013 12:00AM' respectively off the back of these two records in one column into a new column given the second row in the example below has fewer commas.
Code:
df['sample field'].head(2)
4457-I need, this, date, Nov 11 2013 12:00AM ,
2359-I need this, date, Apr 11 2013 12:00AM ,
While the below method expands the data into different columns and staggers which column houses the date, this does not work. I need the date and time (or even just date) information in one column so that I can use the date values in further analysis (for example time-series).
Code:
df['sample field'].str.split(",", expand=True)
Data
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df
Use df.extract with a regex epression
df['Date']= df.Text.str.extract('([A-Za-z]+\s+\d+\s+\d+\s+\d+:[0-9A-Z]+(?=\s+\,+))')
df
#df.Date=pd.to_datetime(df.Date).dt.strftime('%b %d %Y %H:%M%p')
#df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
df['Date']=pd.to_datetime(df['Date'])#This or even df['Date']=pd.to_datetime(df['Date'], format=('%b %d %Y %I:%M%p')) could work. Just remmeber because your time is 12AM use 12 clock hour system %I not %H and also hour 00.00 likely to be trncated, If have say11.00AM, the time will appear
IIUC you need str.extract with a regular expression.
Regex Demo Here
print(df)
0
0 4457-I need, this, date, Nov 11 2013 12:00AM
1 2359-I need this, date, Apr 11 2013 12:00AM
df['date'] = df[0].str.extract('(\w{3}\s\d.*\d{4}\s\d{2}:\d{2}\w{2})')
df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
print(df)
0 date
0 4457-I need, this, date, Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM 2013-04-11 12:00:00
I'll use #wwnde's data :
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df['Date'] = df.Text.str.strip(',').str.split(',').str[-1].str.strip()
df['Date_formatted'] = pd.to_datetime(df.Date, format = '%b %d %Y %H:%M%p')
Text Date Date_formatted
0 4457-I need, this, date, Nov 11 2013 12:00AM , Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM , Apr 11 2013 12:00AM 2013-04-11 12:00:00
I have many strings like these.
Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019
I need to take names for example Roliffe, Tadcorp Pk Munangle, Gecester Park
And dates 15 June 2019, 10 July 2019, 26 June 2019
How can I make it?
I would use regular expressions like this:
import re
string = """Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019"""
places = re.findall(r'([\w ]*) \(.*\)', string)
dates = re.findall(r'\d{2} \w* \d{4}', string)
print(', '.join(places))
print(', '.join(dates))
Output
Roliffe, Tadcorp Pk Munangle, Gecester Park
15 June 2019, 10 July 2019, 26 June 2019
If the data follows the same pattern.
This will not be an efficient one but will work.
s = 'Roliffe (Day) - Thursday, 15 June 2019';
firstSplit = s.split('(');
name = firstSplit[0].trim();
date = firstSplit[1].split(',')[1].trim();
I have data that lists a Term Year ("A", "B", "C", ...) and some data.
A term year is a complete calendar year from that includes all 12 months.
I am trying to determine the most recent, complete, term year with a formula. (Not a UDF if possible).
Example data:
Term Month Year Misc. Data
A January 2017 32
A February 2017 35
A March 2017 448
A April 2017 747
A May 2017 656
A June 2017 370
A June 2017 1892
A July 2017 373
A August 2017 387
A August 2017 3
A August 2017 32992
A September 2017 815
A October 2017 479
A November 2017 753
A December 2017 413
B August 2018 544
B September 2018 541
B October 2018 435
B November 2018 17
B December 2018 270
B January 2018 309
B February 2018 488
(Edit: Added data, there will be multiple entries per month.)
So, since Term A is the most recent from today (being 2019) that has all months , I am just looking to have the formula return A.
As for my current attempts, I can't think of how to work an Index/Match formula. I am "afraid" I'll need a UDF, or at least some type of helper column. So far I've gotten just =Index(A2:A20 but can't think of how to build it from there. I have a hunch Aggregate() may be needed but I can't figure how.
IF you only have a single entry per month, and IF the years are sorted ascending as you show, then try:
=LOOKUP(2,1/(COUNTIFS(Table1[Year],Table1[Year])=12),Table1[[#All],[Term]])
How to get next year period based on current month and year, for example:
Jan 2014 - Dec 2014
Feb 2014 - Jan 2015
Mar 2014 - Feb 2015
Apr 2014 - Mar 2015
May 2014 - Apr 2015
Jun 2014 - May 2015
Jul 2014 - Jun 2015
Aug 2014 - Jul 2015
Sep 2014 - Aug 2015
Oct 2014 - Sep 2015
Nov 2014 - Oct 2015
Dec 2014 - Nov 2015
Next period
Jan 2015 - Dec 2015
Feb 2015 - Jan 2016
etc.
I have tried with the following formula:
=UPPER(TEXT(NOW();"MMM")) &" "& TEXT(NOW();"YY")-1
It works fine for Jan 2014 but can't figure out how to get Dec 2014; Feb 2014 - Jan 2015 and so on?
I think you need the EOMonth formula.
=EOMONTH(NOW(),-13) +1 and =EOMONTH(NOW(),-2) +1 should give give you JAN 2014 to DEC 2014
from the MS Excel documentation
Microsoft Excel stores dates as sequential serial numbers so they can
be used in calculations. By default, January 1, 1900 is serial number
1, and January 1, 2008 is serial number 39448 because it is 39,448
days after January 1, 1900.
To get the text formatting you are after, I would suggest that you stick with formatting the cell/column as #Makyen has suggested. Having said that this is the formula that you can use to format the text.
=UPPER(TEXT(EOMONTH(NOW(),-13) +1, "MMM YY"))
Assuming that the date (as a date serial number) for which you desire to find the year period is in cell A1, the following should provide the next year period starting from that day:
=EOMONTH(A1,11) +DAY(A1) -1
Examples:
Input Output
1/18/2014 1/17/2015
2/18/2014 2/17/2015
3/18/2014 3/17/2015
4/18/2014 4/17/2015
5/18/2014 5/17/2015
6/18/2014 6/17/2015
7/18/2014 7/17/2015
8/18/2014 8/17/2015
9/18/2014 9/17/2015
10/18/2014 10/17/2015
11/18/2014 11/17/2015
12/18/2014 12/17/2015
1/18/2015 1/17/2016
2/18/2015 2/17/2016
3/18/2015 3/17/2016
4/18/2015 4/17/2016
5/18/2015 5/17/2016
6/18/2015 6/17/2016
7/18/2015 7/17/2016
8/18/2015 8/17/2016
9/18/2015 9/17/2016
10/18/2015 10/17/2016
11/18/2015 11/17/2016
12/18/2015 12/17/2016
1/18/2016 1/17/2017
If you want the year period to start from the current day:
=EOMONTH(NOW(),11) + DAY(NOW()) -1
If you want the year period to start from the first day of the current month:
=EOMONTH(EOMONTH(NOW(),-1) + 1,11)
or
=EOMONTH(NOW() - DAY(NOW()) + 1,11)
The EOMONTH() function:
EOMONTH(start_date,months)
Returns the serial number for the last day of the month that is the
indicated number of months before or after start_date. Use EOMONTH to
calculate maturity dates or due dates that fall on the last day of the
month.
If this function is not available, and returns the #NAME? error,
install and load the Analysis ToolPak add-in.