System: WIN10
IDE: MS Visual Studio COde
Language: Python version 3.7.3
Library: pandas version 1.0.1
Data source: supplied in the example below
Dataset: supplied in the example below
Ask:
I need to split the date and time string out of a column from a data frame that has rows of uneven delimiters i.e. some with three and some with four commas.
I am trying to figure out how to strip the date and time values: 'Nov 11 2013 12:00AM', and 'Apr 11 2013 12:00AM' respectively off the back of these two records in one column into a new column given the second row in the example below has fewer commas.
Code:
df['sample field'].head(2)
4457-I need, this, date, Nov 11 2013 12:00AM ,
2359-I need this, date, Apr 11 2013 12:00AM ,
While the below method expands the data into different columns and staggers which column houses the date, this does not work. I need the date and time (or even just date) information in one column so that I can use the date values in further analysis (for example time-series).
Code:
df['sample field'].str.split(",", expand=True)
Data
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df
Use df.extract with a regex epression
df['Date']= df.Text.str.extract('([A-Za-z]+\s+\d+\s+\d+\s+\d+:[0-9A-Z]+(?=\s+\,+))')
df
#df.Date=pd.to_datetime(df.Date).dt.strftime('%b %d %Y %H:%M%p')
#df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
df['Date']=pd.to_datetime(df['Date'])#This or even df['Date']=pd.to_datetime(df['Date'], format=('%b %d %Y %I:%M%p')) could work. Just remmeber because your time is 12AM use 12 clock hour system %I not %H and also hour 00.00 likely to be trncated, If have say11.00AM, the time will appear
IIUC you need str.extract with a regular expression.
Regex Demo Here
print(df)
0
0 4457-I need, this, date, Nov 11 2013 12:00AM
1 2359-I need this, date, Apr 11 2013 12:00AM
df['date'] = df[0].str.extract('(\w{3}\s\d.*\d{4}\s\d{2}:\d{2}\w{2})')
df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
print(df)
0 date
0 4457-I need, this, date, Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM 2013-04-11 12:00:00
I'll use #wwnde's data :
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df['Date'] = df.Text.str.strip(',').str.split(',').str[-1].str.strip()
df['Date_formatted'] = pd.to_datetime(df.Date, format = '%b %d %Y %H:%M%p')
Text Date Date_formatted
0 4457-I need, this, date, Nov 11 2013 12:00AM , Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM , Apr 11 2013 12:00AM 2013-04-11 12:00:00
Related
Incident number
Received date
Closed Date
Time taken to close
111
01 Jan 2021
01 Feb 2021
31
222
01 Jan 2021
07 Feb 2021
37
333
01 Jan 2021
444
01 Jan 2021
I wanted to calculate the average number of days an incidents have been open at a point in time. So using the example above lets say at the end of Feb 2021 you would look at
Received date has to be less then the metric date (the metric date in this case being Feb 2021)
Closed date has to be either greater then metric date or empty (if the closed date is empty then the calculation for time taken to close would be from the received date to the metric date)
Using the example above the first two incidents would not been included, however the last two would be and so the different between 01 Jan 2021 and 28th Feb 2021 is 58 , divide that number by 2 as that’s the number of incidents included in the calculation to give you an average of 58. Using the same example the calculation for Jan 2021 would be 31 days for each incident as no incident was closed by 31st Jan, so its (31*4) / 4. I would be repeating this for Jan – Dec 2020 and 2021
The encoding of an unclosed incident with a missing value will require a case of if statement to properly compute the days open metric on a given asof date.
Example:
The days open average is computed for a variety of asof dates stored in a data set.
data have;
call streaminit(2022);
do id = 1 to 10;
opened = '01jan2021'd + rand('integer', 60);
closed = opened + rand('integer', 90);
if rand('uniform') < 0.25 then call missing(closed);
output;
end;
format opened closed yymmdd10.;
run;
data asof;
do asof = '01jan2021'd to '01jun2021'd-1;
output;
end;
format asof yymmdd10.;
run;
proc sql;
create table averageDaysOpen_asof
as
select
asof
, mean (days_open) as days_open_avg format=6.2
, count(days_open) as id_count
from
( select asof
, opened
, closed
, case
when closed is not null and asof between opened and closed then asof-opened
when closed is null and asof > opened then asof-opened
else .
end as days_open
from asof
cross join have
)
group by asof
;
quit;
So I have the Edition Column which contains data in unevenly pattern, as some have ',' followed by the date and some have ',-' pattern.
df.head()
17 Paperback,– 1 Nov 2016
18 Mass Market Paperback,– 1 Jan 1991
19 Paperback,– 2016
20 Hardcover,– 24 Nov 2018
21 Paperback,– Import, 4 Oct 2018
How can I extract the date to a separate column. I tried using str.split() but can't find specific pattern to extract.Is there any method I could do it?
obj = df['Edition']
obj.str.split('((?:\d+\s+\w+\s+)?\d{4}$)', expand=True)
or
obj.str.split('[,–]+').str[0]
obj.str.split('[,–]+').str[-1] # date
Try using dateutil
from dateutil.parser import parse
df['Dt']=[parse(i, fuzzy_with_tokens=True)[0] for i in df['column']]
I have a dataframe of the form
ID Effective_Date Paid_Off_Time
xqd27070601 09 August 2016 10 July 2016
xqd21601070 09 September 2016 10 July 2016
xqd26010760 10 July 2016 09 November 2016
EDIT
Originally, the dates shown are of type String. Their format can be: like this 9/18/2016 16:56, 09 August 2016, 9/18/2016. Should we consider converting to timestamp for easier comparison?
What I want
if Effective_Date > Paid_Off_Time, replace value of Effective_DatewithPaid_Off_Timeand the value ofPaid_Off_TimewithEffective_Date```.
Basically, switch the values between the 2 columns because the date was insert in the wrong column.
I have thought about using np.where, but I am wondering, isn't there a less verbose, cleaner solution?
#create a new dataFrame
testDf = pd.DataFrame(columns=['Effective_Date','Paid_Off_Time'])
#check if Effective_Date < myDataFrame
testDf['Effective_Date'] = np.where(myDataFrame.Effective_Date < myDataFrame.Paid_Off_Time,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
#check if Paid_Off_Time < Effective_Date
testDf['Paid_Off_Time'] = np.where(myDataFrame.Paid_Off_Time < myDataFrame.Effective_Date,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
myDataFrame['Effective_Date'] = testDf[testDf['Effective_Date']]
myDataFrame['Paid_Off_Time'] = testDf[testDf['Paid_Off_Time']]
Convert dates to datetime
df=df.assign(Effective_Date=pd.to_datetime(df['Effective_Date'], format='%d %B %Y'),Paid_Off_Time=pd.to_datetime(df['Paid_Off_Time'], format='%d %B %Y'))
Select as per condition
m=df.Effective_Date>df.Paid_Off_Time
Swap values if condition met
df.loc[m, ['Effective_Date','Paid_Off_Time']]=df.loc[m, ['Paid_Off_Time','Effective_Date']].values#Swap rows if condition met
print(df)
ID Effective_Date Paid_Off_Time
0 xqd27070601 09 August 2016 10 July 2016
1 xqd21601070 09 September 2016 10 July 2016
2 xqd26010760 09 November 2016 10 July 2016
I am sharing a piece of my project code in which i did somewhat similar thing, I hope this kind of implementation will give you the solution.
df['Effective_date'] = pd.to_datetime(df['Effective_date'], format= '%d/%m/%Y')
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'], format= '%d/%m/%Y')
for i in range(0,len(df))
if df['Effective_Date'][i]>df['Paid_Off_Time'][i]:
k=df['Effective_Date'][i]
df['Effective_Date'][i]=df['Paid_Off_Time'][i]
df['Paid_Off_Time'][i]=k
You can try sorting values in numpy to improve performance:
df['Effective_Date'] = pd.to_datetime(df['Effective_Date'])
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'])
c = ['Effective_Date','Paid_Off_Time']
data = np.sort(myDataFrame[c].to_numpy(), axis=1)
myDataFrame[c] = pd.DataFrame(data, columns=c)
print (myDataFrame)
ID Effective_Date Paid_Off_Time
0 xqd27070601 2016-07-10 2016-08-09
1 xqd21601070 2016-07-10 2016-09-09
2 xqd26010760 2016-07-10 2016-11-09
Is it possible to convert a date/time in Excel such as Mon Nov 11 2019 22:12:22 UTC time to and EST date/time value? Essentially subtract 5 hours from it? Some of the formulas I've been playing with are:
=C2-5/24
=(SUBSTITUTE(LEFT(A1,27),"T"," "))+(MID(A1,28,3)/24)
How to get next year period based on current month and year, for example:
Jan 2014 - Dec 2014
Feb 2014 - Jan 2015
Mar 2014 - Feb 2015
Apr 2014 - Mar 2015
May 2014 - Apr 2015
Jun 2014 - May 2015
Jul 2014 - Jun 2015
Aug 2014 - Jul 2015
Sep 2014 - Aug 2015
Oct 2014 - Sep 2015
Nov 2014 - Oct 2015
Dec 2014 - Nov 2015
Next period
Jan 2015 - Dec 2015
Feb 2015 - Jan 2016
etc.
I have tried with the following formula:
=UPPER(TEXT(NOW();"MMM")) &" "& TEXT(NOW();"YY")-1
It works fine for Jan 2014 but can't figure out how to get Dec 2014; Feb 2014 - Jan 2015 and so on?
I think you need the EOMonth formula.
=EOMONTH(NOW(),-13) +1 and =EOMONTH(NOW(),-2) +1 should give give you JAN 2014 to DEC 2014
from the MS Excel documentation
Microsoft Excel stores dates as sequential serial numbers so they can
be used in calculations. By default, January 1, 1900 is serial number
1, and January 1, 2008 is serial number 39448 because it is 39,448
days after January 1, 1900.
To get the text formatting you are after, I would suggest that you stick with formatting the cell/column as #Makyen has suggested. Having said that this is the formula that you can use to format the text.
=UPPER(TEXT(EOMONTH(NOW(),-13) +1, "MMM YY"))
Assuming that the date (as a date serial number) for which you desire to find the year period is in cell A1, the following should provide the next year period starting from that day:
=EOMONTH(A1,11) +DAY(A1) -1
Examples:
Input Output
1/18/2014 1/17/2015
2/18/2014 2/17/2015
3/18/2014 3/17/2015
4/18/2014 4/17/2015
5/18/2014 5/17/2015
6/18/2014 6/17/2015
7/18/2014 7/17/2015
8/18/2014 8/17/2015
9/18/2014 9/17/2015
10/18/2014 10/17/2015
11/18/2014 11/17/2015
12/18/2014 12/17/2015
1/18/2015 1/17/2016
2/18/2015 2/17/2016
3/18/2015 3/17/2016
4/18/2015 4/17/2016
5/18/2015 5/17/2016
6/18/2015 6/17/2016
7/18/2015 7/17/2016
8/18/2015 8/17/2016
9/18/2015 9/17/2016
10/18/2015 10/17/2016
11/18/2015 11/17/2016
12/18/2015 12/17/2016
1/18/2016 1/17/2017
If you want the year period to start from the current day:
=EOMONTH(NOW(),11) + DAY(NOW()) -1
If you want the year period to start from the first day of the current month:
=EOMONTH(EOMONTH(NOW(),-1) + 1,11)
or
=EOMONTH(NOW() - DAY(NOW()) + 1,11)
The EOMONTH() function:
EOMONTH(start_date,months)
Returns the serial number for the last day of the month that is the
indicated number of months before or after start_date. Use EOMONTH to
calculate maturity dates or due dates that fall on the last day of the
month.
If this function is not available, and returns the #NAME? error,
install and load the Analysis ToolPak add-in.