Incident number
Received date
Closed Date
Time taken to close
111
01 Jan 2021
01 Feb 2021
31
222
01 Jan 2021
07 Feb 2021
37
333
01 Jan 2021
444
01 Jan 2021
I wanted to calculate the average number of days an incidents have been open at a point in time. So using the example above lets say at the end of Feb 2021 you would look at
Received date has to be less then the metric date (the metric date in this case being Feb 2021)
Closed date has to be either greater then metric date or empty (if the closed date is empty then the calculation for time taken to close would be from the received date to the metric date)
Using the example above the first two incidents would not been included, however the last two would be and so the different between 01 Jan 2021 and 28th Feb 2021 is 58 , divide that number by 2 as that’s the number of incidents included in the calculation to give you an average of 58. Using the same example the calculation for Jan 2021 would be 31 days for each incident as no incident was closed by 31st Jan, so its (31*4) / 4. I would be repeating this for Jan – Dec 2020 and 2021
The encoding of an unclosed incident with a missing value will require a case of if statement to properly compute the days open metric on a given asof date.
Example:
The days open average is computed for a variety of asof dates stored in a data set.
data have;
call streaminit(2022);
do id = 1 to 10;
opened = '01jan2021'd + rand('integer', 60);
closed = opened + rand('integer', 90);
if rand('uniform') < 0.25 then call missing(closed);
output;
end;
format opened closed yymmdd10.;
run;
data asof;
do asof = '01jan2021'd to '01jun2021'd-1;
output;
end;
format asof yymmdd10.;
run;
proc sql;
create table averageDaysOpen_asof
as
select
asof
, mean (days_open) as days_open_avg format=6.2
, count(days_open) as id_count
from
( select asof
, opened
, closed
, case
when closed is not null and asof between opened and closed then asof-opened
when closed is null and asof > opened then asof-opened
else .
end as days_open
from asof
cross join have
)
group by asof
;
quit;
Related
I am currently working on grouping/aggregating data based on date range for a weekly plot.
Below is how my dataframe looks like for Daily data:
daily_dates
registered
attended
02/10/2022
0
0
02/09/2022
0
0
02/08/2022
1
0
02/07/2022
1
0
02/06/2022
20
06
02/05/2022
05
03
02/04/2022
15
12
02/03/2022
10
08
02/02/2022
10
05
The first day of the week I'd want is Sunday.
My current code to perform weekly group is:
weekly_df = weekly_df.resample('w').sum().reset_index()
The output I am desiring is:
weekly_dates
registered
attended
02/06/2022
22
06
01/30/2022
40
28
A bit of explanation about desired output - the reason for 02/06/2022 & 01/30/2022 is that both these dates are start date of that respective week which is a sunday. And for the week of 01/30/2022 only 02/05/2022|05|03, 02/04/2022, 02/03/2022, 02/02/2022 dates are considered as those are the one's in the daily dataframe.
My current implementation follows the instructions provided here.
I am looking for any suggestion to achieve my Desired Output
Try:
df.resample('W-SUN', label='left', closed='left').sum().reset_index()
Output:
daily_dates registered attended
0 2022-01-30 40 28
1 2022-02-06 22 6
I have a dataframe of the form
ID Effective_Date Paid_Off_Time
xqd27070601 09 August 2016 10 July 2016
xqd21601070 09 September 2016 10 July 2016
xqd26010760 10 July 2016 09 November 2016
EDIT
Originally, the dates shown are of type String. Their format can be: like this 9/18/2016 16:56, 09 August 2016, 9/18/2016. Should we consider converting to timestamp for easier comparison?
What I want
if Effective_Date > Paid_Off_Time, replace value of Effective_DatewithPaid_Off_Timeand the value ofPaid_Off_TimewithEffective_Date```.
Basically, switch the values between the 2 columns because the date was insert in the wrong column.
I have thought about using np.where, but I am wondering, isn't there a less verbose, cleaner solution?
#create a new dataFrame
testDf = pd.DataFrame(columns=['Effective_Date','Paid_Off_Time'])
#check if Effective_Date < myDataFrame
testDf['Effective_Date'] = np.where(myDataFrame.Effective_Date < myDataFrame.Paid_Off_Time,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
#check if Paid_Off_Time < Effective_Date
testDf['Paid_Off_Time'] = np.where(myDataFrame.Paid_Off_Time < myDataFrame.Effective_Date,myDataFrame.Effective_Date,myDataFrame.Paid_Off_Time)
myDataFrame['Effective_Date'] = testDf[testDf['Effective_Date']]
myDataFrame['Paid_Off_Time'] = testDf[testDf['Paid_Off_Time']]
Convert dates to datetime
df=df.assign(Effective_Date=pd.to_datetime(df['Effective_Date'], format='%d %B %Y'),Paid_Off_Time=pd.to_datetime(df['Paid_Off_Time'], format='%d %B %Y'))
Select as per condition
m=df.Effective_Date>df.Paid_Off_Time
Swap values if condition met
df.loc[m, ['Effective_Date','Paid_Off_Time']]=df.loc[m, ['Paid_Off_Time','Effective_Date']].values#Swap rows if condition met
print(df)
ID Effective_Date Paid_Off_Time
0 xqd27070601 09 August 2016 10 July 2016
1 xqd21601070 09 September 2016 10 July 2016
2 xqd26010760 09 November 2016 10 July 2016
I am sharing a piece of my project code in which i did somewhat similar thing, I hope this kind of implementation will give you the solution.
df['Effective_date'] = pd.to_datetime(df['Effective_date'], format= '%d/%m/%Y')
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'], format= '%d/%m/%Y')
for i in range(0,len(df))
if df['Effective_Date'][i]>df['Paid_Off_Time'][i]:
k=df['Effective_Date'][i]
df['Effective_Date'][i]=df['Paid_Off_Time'][i]
df['Paid_Off_Time'][i]=k
You can try sorting values in numpy to improve performance:
df['Effective_Date'] = pd.to_datetime(df['Effective_Date'])
df['Paid_Off_Time'] = pd.to_datetime(df['Paid_Off_Time'])
c = ['Effective_Date','Paid_Off_Time']
data = np.sort(myDataFrame[c].to_numpy(), axis=1)
myDataFrame[c] = pd.DataFrame(data, columns=c)
print (myDataFrame)
ID Effective_Date Paid_Off_Time
0 xqd27070601 2016-07-10 2016-08-09
1 xqd21601070 2016-07-10 2016-09-09
2 xqd26010760 2016-07-10 2016-11-09
I have the following script:
DROP TABLE IF EXISTS [dbo].[test]
CREATE TABLE [dbo].[test]
(
[Name] [varchar](50) NULL,
[Amount] [int] NULL
)
GO
INSERT INTO [dbo].[test]
(
[Name]
,[Amount]
)
VALUES
('Abc - 20 april to 7 june 2020',25)
,('Abc - 20 april to 7 june 2020',33)
,('Abc - 20 april to 29 june 2020',15)
,('Abc - 20 april to 29 june 2020',55)
,('Abc - 20 april to 10 may 2020',20)
,('Abc - 20 april to 10 may 2020',75)
,('Abc - 20 april to 10 may 2020',89)
GO
SELECT *
FROM [dbo].[test]
The resulting table gives the following results:
Name | Amount
-------------------------------|-------
Abc - 20 april to 7 june 2020 | 25
Abc - 20 april to 7 june 2020 | 33
Abc - 20 april to 29 june 2020 | 15
Abc - 20 april to 29 june 2020 | 55
Abc - 20 april to 10 may 2020 | 20
Abc - 20 april to 10 may 2020 | 75
Abc - 20 april to 10 may 2020 | 89
I would like to be able to determine the most recent end date in the text field and eliminate the repetitive data by grouping it showing the record with the latest end date and aggregating the amount. The result should be a single record with the following data:
Name | Amount
-------------------------------|-------
Abc - 20 april to 29 june 2020 | 312
I've played around with some group bys and text manipulation functions and this is as far as I got using the following code:
select (case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end) as [Name]
,sum(Amount) as Amount
from [dbo].[test]
group by
(case when [Name] like '%-%'
then trim(left([Name], charindex('-', reverse([Name])) - 1))
else [Name]
end)
The above code doesn't really do so much as to just aggregate unique records only. I was looking for something more intelligent that will find and recognize that all of my records really mean the same thing and that the latest end date in the Name field is 29 june 2020 and will only display that single record with the total aggregated Amount.
Any help would be much appreciated.
System: WIN10
IDE: MS Visual Studio COde
Language: Python version 3.7.3
Library: pandas version 1.0.1
Data source: supplied in the example below
Dataset: supplied in the example below
Ask:
I need to split the date and time string out of a column from a data frame that has rows of uneven delimiters i.e. some with three and some with four commas.
I am trying to figure out how to strip the date and time values: 'Nov 11 2013 12:00AM', and 'Apr 11 2013 12:00AM' respectively off the back of these two records in one column into a new column given the second row in the example below has fewer commas.
Code:
df['sample field'].head(2)
4457-I need, this, date, Nov 11 2013 12:00AM ,
2359-I need this, date, Apr 11 2013 12:00AM ,
While the below method expands the data into different columns and staggers which column houses the date, this does not work. I need the date and time (or even just date) information in one column so that I can use the date values in further analysis (for example time-series).
Code:
df['sample field'].str.split(",", expand=True)
Data
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df
Use df.extract with a regex epression
df['Date']= df.Text.str.extract('([A-Za-z]+\s+\d+\s+\d+\s+\d+:[0-9A-Z]+(?=\s+\,+))')
df
#df.Date=pd.to_datetime(df.Date).dt.strftime('%b %d %Y %H:%M%p')
#df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
df['Date']=pd.to_datetime(df['Date'])#This or even df['Date']=pd.to_datetime(df['Date'], format=('%b %d %Y %I:%M%p')) could work. Just remmeber because your time is 12AM use 12 clock hour system %I not %H and also hour 00.00 likely to be trncated, If have say11.00AM, the time will appear
IIUC you need str.extract with a regular expression.
Regex Demo Here
print(df)
0
0 4457-I need, this, date, Nov 11 2013 12:00AM
1 2359-I need this, date, Apr 11 2013 12:00AM
df['date'] = df[0].str.extract('(\w{3}\s\d.*\d{4}\s\d{2}:\d{2}\w{2})')
df['date'] = pd.to_datetime(df['date'] ,format='%b %d %Y %H:%M%p')
print(df)
0 date
0 4457-I need, this, date, Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM 2013-04-11 12:00:00
I'll use #wwnde's data :
df=pd.DataFrame({'Text':['4457-I need, this, date, Nov 11 2013 12:00AM ,','2359-I need this, date, Apr 11 2013 12:00AM ,']})
df['Date'] = df.Text.str.strip(',').str.split(',').str[-1].str.strip()
df['Date_formatted'] = pd.to_datetime(df.Date, format = '%b %d %Y %H:%M%p')
Text Date Date_formatted
0 4457-I need, this, date, Nov 11 2013 12:00AM , Nov 11 2013 12:00AM 2013-11-11 12:00:00
1 2359-I need this, date, Apr 11 2013 12:00AM , Apr 11 2013 12:00AM 2013-04-11 12:00:00
I have a CSV file that I receive once a week that is in the following format:
"Item","Supplier Item","Description","1","2","3","4","5","6","7","8" ...Linefeed
"","","","Past Due","Day 13-OCT-2014","Buffer 14-OCT-2014","Week 20-OCT-2014","Week 27-OCT-2014", ...LineFeed
"Part1","P1","Big Part","0","0","0","100","50", ...LineFeed
"Part4","P4","Red Part","0","0","0","35","40", ...LineFeed
"Part92","P92","White Part","0","0","0","10","20", ...LineFeed
...
An explanation of the data - Row 2 is dynamic data signifying the date parts are due. Row 3 begins the part numbers with description and number of parts due on a particular date. So looking at the above data: row 3 column7 shows that PartNo1 has 100 parts due on the Week of OCT 20 2014 and 50 due on the Week of OCT 27, 2014.
How can I parse this csv to show the data like this:
Item, Supplier Item, Description, Past Due, Due Date Amount Due
Part1 P1 Big Part 0 20 OCT 2014 100
Part1 P1 Big Part 0 27 OCT 2014 50
Part4 P4 Red Part 0 20 OCT 2014 35
Part4 P4 Red Part 0 27 OCT 2014 40
....
Is there a way to manipulate the format in Excel to rearrange the data like I need or what is the best method to resolve this?