I have a dataframe of the form
ID Effective_Date Paid_Off_Time
xqd27070601 09 August 2016 10 July 2016
xqd21601070 09 September 2016 10 July 2016
xqd26010760 10 July 2016 09 November 2016
EDIT
Originally, the dates shown are of type String. Their format can be: like this 9/18/2016 16:56, 09 August 2016, 9/18/2016. Should we consider converting to timestamp for easier comparison?
What I want
if Effective_Date > Paid_Off_Time, replace value of Effective_DatewithPaid_Off_Timeand the value ofPaid_Off_TimewithEffective_Date```.
Basically, switch the values between the 2 columns because the date was insert in the wrong column.
I have thought about using np.where, but I am wondering, isn't there a less verbose, cleaner solution?
#create a new dataFrame
testDf = pd.DataFrame(columns=['Effective_Date','Paid_Off_Time'])
#check if Effective_Date < myDataFrame
testDf['Effective_Date'] = np.where(myDataFrame.Effective_Date < myDataFrame.Paid_Off_Time,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
#check if Paid_Off_Time < Effective_Date
testDf['Paid_Off_Time'] = np.where(myDataFrame.Paid_Off_Time < myDataFrame.Effective_Date,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
myDataFrame['Effective_Date'] = testDf[testDf['Effective_Date']]
myDataFrame['Paid_Off_Time'] = testDf[testDf['Paid_Off_Time']]
Convert dates to datetime
df=df.assign(Effective_Date=pd.to_datetime(df['Effective_Date'], format='%d %B %Y'),Paid_Off_Time=pd.to_datetime(df['Paid_Off_Time'], format='%d %B %Y'))
Select as per condition
m=df.Effective_Date>df.Paid_Off_Time
Swap values if condition met
df.loc[m, ['Effective_Date','Paid_Off_Time']]=df.loc[m, ['Paid_Off_Time','Effective_Date']].values#Swap rows if condition met
print(df)
ID Effective_Date Paid_Off_Time
0 xqd27070601 09 August 2016 10 July 2016
1 xqd21601070 09 September 2016 10 July 2016
2 xqd26010760 09 November 2016 10 July 2016
I am sharing a piece of my project code in which i did somewhat similar thing, I hope this kind of implementation will give you the solution.
df['Effective_date'] = pd.to_datetime(df['Effective_date'], format= '%d/%m/%Y')
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'], format= '%d/%m/%Y')
for i in range(0,len(df))
if df['Effective_Date'][i]>df['Paid_Off_Time'][i]:
k=df['Effective_Date'][i]
df['Effective_Date'][i]=df['Paid_Off_Time'][i]
df['Paid_Off_Time'][i]=k
You can try sorting values in numpy to improve performance:
df['Effective_Date'] = pd.to_datetime(df['Effective_Date'])
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'])
c = ['Effective_Date','Paid_Off_Time']
data = np.sort(myDataFrame[c].to_numpy(), axis=1)
myDataFrame[c] = pd.DataFrame(data, columns=c)
print (myDataFrame)
ID Effective_Date Paid_Off_Time
0 xqd27070601 2016-07-10 2016-08-09
1 xqd21601070 2016-07-10 2016-09-09
2 xqd26010760 2016-07-10 2016-11-09
Related
Incident number
Received date
Closed Date
Time taken to close
111
01 Jan 2021
01 Feb 2021
31
222
01 Jan 2021
07 Feb 2021
37
333
01 Jan 2021
444
01 Jan 2021
I wanted to calculate the average number of days an incidents have been open at a point in time. So using the example above lets say at the end of Feb 2021 you would look at
Received date has to be less then the metric date (the metric date in this case being Feb 2021)
Closed date has to be either greater then metric date or empty (if the closed date is empty then the calculation for time taken to close would be from the received date to the metric date)
Using the example above the first two incidents would not been included, however the last two would be and so the different between 01 Jan 2021 and 28th Feb 2021 is 58 , divide that number by 2 as that’s the number of incidents included in the calculation to give you an average of 58. Using the same example the calculation for Jan 2021 would be 31 days for each incident as no incident was closed by 31st Jan, so its (31*4) / 4. I would be repeating this for Jan – Dec 2020 and 2021
The encoding of an unclosed incident with a missing value will require a case of if statement to properly compute the days open metric on a given asof date.
Example:
The days open average is computed for a variety of asof dates stored in a data set.
data have;
call streaminit(2022);
do id = 1 to 10;
opened = '01jan2021'd + rand('integer', 60);
closed = opened + rand('integer', 90);
if rand('uniform') < 0.25 then call missing(closed);
output;
end;
format opened closed yymmdd10.;
run;
data asof;
do asof = '01jan2021'd to '01jun2021'd-1;
output;
end;
format asof yymmdd10.;
run;
proc sql;
create table averageDaysOpen_asof
as
select
asof
, mean (days_open) as days_open_avg format=6.2
, count(days_open) as id_count
from
( select asof
, opened
, closed
, case
when closed is not null and asof between opened and closed then asof-opened
when closed is null and asof > opened then asof-opened
else .
end as days_open
from asof
cross join have
)
group by asof
;
quit;
So I have the Edition Column which contains data in unevenly pattern, as some have ',' followed by the date and some have ',-' pattern.
df.head()
17 Paperback,– 1 Nov 2016
18 Mass Market Paperback,– 1 Jan 1991
19 Paperback,– 2016
20 Hardcover,– 24 Nov 2018
21 Paperback,– Import, 4 Oct 2018
How can I extract the date to a separate column. I tried using str.split() but can't find specific pattern to extract.Is there any method I could do it?
obj = df['Edition']
obj.str.split('((?:\d+\s+\w+\s+)?\d{4}$)', expand=True)
or
obj.str.split('[,–]+').str[0]
obj.str.split('[,–]+').str[-1] # date
Try using dateutil
from dateutil.parser import parse
df['Dt']=[parse(i, fuzzy_with_tokens=True)[0] for i in df['column']]
I have the following script:
DROP TABLE IF EXISTS [dbo].[test]
CREATE TABLE [dbo].[test]
(
[Name] [varchar](50) NULL,
[Amount] [int] NULL
)
GO
INSERT INTO [dbo].[test]
(
[Name]
,[Amount]
)
VALUES
('Abc - 20 april to 7 june 2020',25)
,('Abc - 20 april to 7 june 2020',33)
,('Abc - 20 april to 29 june 2020',15)
,('Abc - 20 april to 29 june 2020',55)
,('Abc - 20 april to 10 may 2020',20)
,('Abc - 20 april to 10 may 2020',75)
,('Abc - 20 april to 10 may 2020',89)
GO
SELECT *
FROM [dbo].[test]
The resulting table gives the following results:
Name | Amount
-------------------------------|-------
Abc - 20 april to 7 june 2020 | 25
Abc - 20 april to 7 june 2020 | 33
Abc - 20 april to 29 june 2020 | 15
Abc - 20 april to 29 june 2020 | 55
Abc - 20 april to 10 may 2020 | 20
Abc - 20 april to 10 may 2020 | 75
Abc - 20 april to 10 may 2020 | 89
I would like to be able to determine the most recent end date in the text field and eliminate the repetitive data by grouping it showing the record with the latest end date and aggregating the amount. The result should be a single record with the following data:
Name | Amount
-------------------------------|-------
Abc - 20 april to 29 june 2020 | 312
I've played around with some group bys and text manipulation functions and this is as far as I got using the following code:
select (case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end) as [Name]
,sum(Amount) as Amount
from [dbo].[test]
group by
(case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end)
The above code doesn't really do so much as to just aggregate unique records only. I was looking for something more intelligent that will find and recognize that all of my records really mean the same thing and that the latest end date in the Name field is 29 june 2020 and will only display that single record with the total aggregated Amount.
Any help would be much appreciated.
System: WIN10
IDE: MS Visual Studio COde
Language: Python version 3.7.3
Library: pandas version 1.0.1
Data source: supplied in the example below
Dataset: supplied in the example below
Ask:
I need to split the date and time string out of a column from a data frame that has rows of uneven delimiters i.e. some with three and some with four commas.
I am trying to figure out how to strip the date and time values: 'Nov 11 2013 12:00AM', and 'Apr 11 2013 12:00AM' respectively off the back of these two records in one column into a new column given the second row in the example below has fewer commas.
Code:
df['sample field'].head(2)
4457-I need, this, date, Nov 11 2013 12:00AM ,
2359-I need this, date, Apr 11 2013 12:00AM ,
While the below method expands the data into different columns and staggers which column houses the date, this does not work. I need the date and time (or even just date) information in one column so that I can use the date values in further analysis (for example time-series).
Code:
df['sample field'].str.split(",", expand=True)
Data
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df
Use df.extract with a regex epression
df['Date']= df.Text.str.extract('([A-Za-z]+\s+\d+\s+\d+\s+\d+:[0-9A-Z]+(?=\s+\,+))')
df
#df.Date=pd.to_datetime(df.Date).dt.strftime('%b %d %Y %H:%M%p')
#df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
df['Date']=pd.to_datetime(df['Date'])#This or even df['Date']=pd.to_datetime(df['Date'], format=('%b %d %Y %I:%M%p')) could work. Just remmeber because your time is 12AM use 12 clock hour system %I not %H and also hour 00.00 likely to be trncated, If have say11.00AM, the time will appear
IIUC you need str.extract with a regular expression.
Regex Demo Here
print(df)
0
0 4457-I need, this, date, Nov 11 2013 12:00AM
1 2359-I need this, date, Apr 11 2013 12:00AM
df['date'] = df[0].str.extract('(\w{3}\s\d.*\d{4}\s\d{2}:\d{2}\w{2})')
df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
print(df)
0 date
0 4457-I need, this, date, Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM 2013-04-11 12:00:00
I'll use #wwnde's data :
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df['Date'] = df.Text.str.strip(',').str.split(',').str[-1].str.strip()
df['Date_formatted'] = pd.to_datetime(df.Date, format = '%b %d %Y %H:%M%p')
Text Date Date_formatted
0 4457-I need, this, date, Nov 11 2013 12:00AM , Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM , Apr 11 2013 12:00AM 2013-04-11 12:00:00
I have many strings like these.
Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019
I need to take names for example Roliffe, Tadcorp Pk Munangle, Gecester Park
And dates 15 June 2019, 10 July 2019, 26 June 2019
How can I make it?
I would use regular expressions like this:
import re
string = """Roliffe (Day) - Thursday, 15 June 2019
Tadcorp Pk Munangle (Day) - Tuesday, 10 July 2019
Gecester Park (Night) - Friday, 26 June 2019"""
places = re.findall(r'([\w ]*) \(.*\)', string)
dates = re.findall(r'\d{2} \w* \d{4}', string)
print(', '.join(places))
print(', '.join(dates))
Output
Roliffe, Tadcorp Pk Munangle, Gecester Park
15 June 2019, 10 July 2019, 26 June 2019
If the data follows the same pattern.
This will not be an efficient one but will work.
s = 'Roliffe (Day) - Thursday, 15 June 2019';
firstSplit = s.split('(');
name = firstSplit[0].trim();
date = firstSplit[1].split(',')[1].trim();