standardize of data in sas - statistics

*Objective is to interpolate and extrapolate date , endtime and likes *
I have data like this
Date end time likes
08/11/2013 3.36 pm 36439569
09/11/2013 4.00pm 36439669
10/11/2013 3.10pm 36439700
11/11/2013 4.15pm 36439713
12/11/2013 12.00pm 36439719
14/11/2013 2.00pm 36439730
15/11/2013 4.10pm 36439800
16/11/2013 9.00pm 36439881
we are collected data from online. each day at Irregular time intervals the data is collected at each day. for example i collected data(total number of cumulative likes) for 08/11/2013 at time intervals 9 am 10 am 5 pm ,...note that the time is irregular.
now i need to standardize this time intervals at day level and also likes. my final output looks like this
Final Output
08/11/2013 3.00AM 09/11/2013 3.00AM 100
09/11/2013 3.00AM 10/11/2013 3.00AM 20
10/11/2013 3.00AM 11/11/2013 3.00AM 644
11/11/2013 3.00AM 12/11/2013 3.00AM 21
12/11/2013 3.00AM 13/11/2013 3.00AM 58
13/11/2013 3.00AM 14/11/2013 3.00AM 2
14/11/2013 3.00AM 15/11/2013 3.00AM 125
15/11/2013 3.00AM 16/11/2013 3.00AM 35
Please help me to get this output
thanks

If you have SAS/ETS licenced then check out PROC EXPAND, which can be used for interpolation and extrapolation.
As #Yick hinted, you're more likely to get a full answer on here if you try to solve the problem yourself first

Related

Formula to calculate total duty hours

How do I find the total working hours of the below driver?
Duty code Dep.Time Arri.Time
A001 03:35 04:20
A001 04:35 05:20
A001 05:51 06:20
A001 06:40 07:20
A001 09:40 10:20
Total Working Hour: 10:20-03:35 = 06:45hrs
Is there a formula to find the total working hours of a single person or a single duty card?
If you only have one Duty Code as in the example, you can use the MAX and MIN functions to calculate the total hours.
If you have more than one Duty Code, you can use MAXIFS and MINIFS.

Pandas strftime with 24 hour format

QUESTION
How can I convert 24 hour time to 12 hour time, when the time provided is two characters long? For example: How to format 45 as 12:45 AM.
ATTEMPT
I can get most of the time conversions to format properly with the following:
df=df.assign(newtime=pd.to_datetime(df['Time Occurred'], format='%H%M').dt.strftime("%I:%M %p"))
df.head()
Date Reported Date Occurred Time Occurred newtime
9/13/2010 9/12/2010 45 4:05 AM
8/9/2010 8/9/2010 1515 3:15 PM
1/8/2010 1/7/2010 2005 8:05 PM
1/9/2010 1/6/2010 2100 9:00 PM
1/15/2010 1/15/2010 245 2:45 AM
In the above the values in newtime are properly formatted, except where in the input time is "45" - that time had the result 4:05 AM. Does anyone know how to create the proper output?
to_datetime
times = pd.to_datetime([
f'{h:02d}:{m:02d}:00' for h, m in zip(*df['Time Occurred'].astype(int).__divmod__(100))
])
df.assign(newtime=times.strftime('%I:%M %p'))
Time Occurred newtime
0 45 12:45 AM
1 1515 03:15 PM
2 2005 08:05 PM
3 2100 09:00 PM
4 245 02:45 AM

Convert incomplete 12h datetime-like strings into appropriate datetime type

I've got a pandas Series containing datetime-like strings with 12h format, but without the am/pm abbreviations. It covers an entire month of data :
40 01/01/2017 11:51:00
41 01/01/2017 11:51:05
42 01/01/2017 11:55:05
43 01/01/2017 11:55:10
44 01/01/2017 11:59:30
45 01/01/2017 11:59:35
46 02/01/2017 12:00:05
47 02/01/2017 12:00:10
48 02/01/2017 12:13:20
49 02/01/2017 12:13:25
50 02/01/2017 12:24:50
51 02/01/2017 12:24:55
52 02/01/2017 12:33:30
Name: TS, dtype: object
(318621,) # shape
My goal is to convert it to datetime format, so as to obtain the appropriate unix timestamps values, and make comparisions/arithmetics with other datetime data with, this time, 24h format. So I already tried this :
pd.to_datetime(df.TS, format = '%d/%m/%Y %I:%M:%S') # %I for 12h format
Which outputs me :
64 2017-01-02 00:46:50
65 2017-01-02 00:46:55
66 2017-01-02 01:01:00
67 2017-01-02 01:01:05
68 2017-01-02 01:05:00
But the am/pm informations are not taken into account. I know that, as a rule, the am/pm first have to be specified in the strings, then one can use dt.dt.strptime() or pd.to_datetime() to parse them with the %p indicator.
So I wanted to know if there's an other way to deal with this issue through datetime or pandas datetime modules ? Or, do I have to manualy add the abbreviations 'am/pm' before the parsing ?
You have data in 5 second intervals throughout multiple days. The desired end format is like this (with AM/PM column we need to add, because Pandas cannot possibly guess, since it looks at one value at a time):
31/12/2016 11:59:55 PM
01/01/2017 12:00:00 AM
01/01/2017 12:00:05 AM
01/01/2017 11:59:55 AM
01/01/2017 12:00:00 PM
01/01/2017 12:59:55 PM
01/01/2017 01:00:00 PM
01/01/2017 01:00:05 PM
01/01/2017 11:59:55 PM
02/01/2017 12:00:00 AM
First, we can parse the whole thing without AM/PM info, as you already showed:
ts = pd.to_datetime(df.TS, format = '%d/%m/%Y %I:%M:%S')
We have a small problem: 12:00:00 is parsed as noon, not midnight. Let's normalize that:
ts[ts.dt.hour == 12] -= pd.Timedelta(12, 'h')
Now we have times from 00:00:00 to 11:59:55, twice per day.
Next, note that the transitions are always at 00:00:00. We can easily detect these, as well as the first instance of each date:
twelve = ts.dt.time == datetime.time(0,0,0)
newdate = ts.dt.date.diff() > pd.Timedelta(0)
midnight = twelve & newdate
noon = twelve & ~newdate
Next, build an offset series, which should be easy to inspect for correctness:
offset = pd.Series(np.nan, ts.index, dtype='timedelta64[ns]')
offset[midnight] = pd.Timedelta(0)
offset[noon] = pd.Timedelta(12, 'h')
offset.fillna(method='ffill', inplace=True)
And finally:
ts += offset

DAX or Excel query

How can I write a query in DAX so if I have a data like this, where there is startdate and end date and the total working hour and I need to assign the working hours values to each day .
So what I want is when the start date and the end date is not the same date than the Hours divide between those days.
For example -
User Start-Date End-DATE Hour
Dan 2015-02-05 2015-02-08 32
Here the Start-Date is feb 05 and the End-DATE is 2015-02-08 and Hour is 32
The difference between Start-Date and End-DATE is 4 days.
So I want to divide the hours by the difference of Start-Date and End-DATE and assign those values on each days
So the Expected Output will be ---
User Date Hour
Dan 2015-02-05 8
Dan 2015-02-06 8
Dan 2015-02-07 8
Dan 2015-02-08 8
What I have
User Start-Date End-DATE Hour
Dan 2015-02-05 2015-02-08 32
Dan 2015-02-09 2015-02-09 6
Dan 2015-02-10 2015-02-11 3
Dan 2015-02-11 2015-02-12 8
Expected result -
User Date Hour
Dan 2015-02-05 8
Dan 2015-02-06 8
Dan 2015-02-07 8
Dan 2015-02-08 8
Dan 2015-02-09 6
Dan 2015-02-10 3
Dan 2015-02-11 8
Any one have an idea how to do that in DAX or excel query !
So just before I start I have used Power Bi for this solution which is a free tool to use for personal use however there are some licensing options available. Essentially the premise is the same however there is an element of Power Query used along with some DAX. Anyway here are my steps:
Here is my table before I have made any changes
User Start End Hour
UserA 01/01/2018 05/01/2018 32
Create a custom column to calculate the number of dates between [Start] and [End]
Dates = { Number.From([Start]..Number.From([End]) }
Create another custom column to calculate the number of [Hour]s divided by the number of list items. Do not expand the list before this step!
CountPerDay = [Hour] / List.Count([Dates])
Finally expand your list column to show the number of hours required per day. Note how the dates are in a numeric format. Changing the column data type to "Date" or "DateTime" will change these back to the correct value.
Using my example you should now have a table that looks something like the below:
User Start End Hour Dates CountPerDay
UserA 01/01/2018 05/01/2018 32 01/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 02/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 03/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 04/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 05/01/2018 6.4
If I add the UserB into the mix with the following records:
User Start End Hour
UserB 01/02/2018 02/02/2018 10
The table updates as follows:
User Start End Hour Dates CountPerDay
UserA 01/01/2018 05/01/2018 32 01/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 02/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 03/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 04/01/2018 6.4
UserA 01/01/2018 05/01/2018 32 05/01/2018 6.4
UserB 01/02/2018 02/02/2018 10 01/02/2018 5
UserB 01/02/2018 02/02/2018 10 02/02/2018 5
I hope this helps
J

Daily and Hourly Averages from (m/d/yyyy h:mm) timestamps in Excel

I have an Excel 2007 spreadsheet with date entries in this format m/d/yyyy h:mm (one cell). I would like find the hourly and daily average of all the columns of this spreadsheet and save each time aggregation to a new worksheet.
The data is recorded every ~10 minutes, but throughout the dates of data collection there was some time slips. Not every hour has the same number of rows. Also, the ending minute is either 0 or 6 depending on the time correction.
What would be a good way to approach this task within Excel 2007? It seems like this might be possible with a pivot table if I can create a formula that will select the correct range for the timestamps. Thanks.
For example, an date time entry in TIMESTAMP, 10/31/2012 0:06 which is in one cell.
TIMESTAMP Month Day Year Hour Min Rain_mm Rain_mm_2 AirTC AirTC_2 FuelM FuelM_2 VW ... there are ~16 variables (total) after the data time
10/31/2012 0:06 10 31 2012 0 06 0 0 26.11 26.08 2.545 6.4 0.049
10/31/2012 0:16 10 31 2012 0 16 0 0 25.98 25.97 2.624 6.6 0.049
10/31/2012 0:26 10 31 2012 0 26 0 0 24.32 23.33 2.543 6.5 0.048
10/31/2012 0:36 10 31 2012 0 36 0 0 24.32 23.33 2.543 6.5 0.048
10/31/2012 0:46 10 31 2012 0 46 0 0 24.32 23.33 2.543 6.5 0.048
10/31/2012 0:56 10 31 2012 0 56 0 0 25.87 25.87 2.753 7.3 0.049
10/31/2012 1:06 10 31 2012 0 06 0 0 25.74 25.74 2.879 8.1 0.051
## The above is just over one hour of collection on one day ##
...
## Different Day ### Notice Missing Time Stamp
11/30/2012 0:00 11 30 2012 0 06 0 0.1 26.12 26.18 2.535 6.4 0.049
11/30/2012 0:10 11 30 2012 0 16 0 0.1 25.90 25.77 2.424 6.6 0.049
11/30/2012 0:20 11 30 2012 0 26 0.1 0.2 24.12 24.43 2.542 6.4 0.046
11/30/2012 0:30 11 30 2012 0 36 0.1 0 24.22 22.32 2.543 6.5 0.048
11/30/2012 0:50 11 30 2012 0 56 0.1 0.2 26.77 25.87 2.743 6.3 0.049
11/30/2012 1:00 11 30 2012 0 06 0 0 24.34 24.77 2.459 5.1 0.050
## so forth on so on ##
After clarification of the requirement for daily averages edited to cover both daily and hourly averages:
Add a column (here B) for ‘H’ (ie hour) with =HOUR(A2) copied down.
(Note: Though formatted to show only m/d/y content of ColumnA is, in line with title, assumed to be all of mm/dd/yyyy hh:mm. Makes existing columns [with names jumbled] Month, Day, Year, Hour redundant).
Select data range.
Data, Subtotal, At each change in: TIMESTAMP, Use function: Average, Add subtotal to: check only columns G and to the right, OK.
Uncheck Replace current subtotals in Subtotal and apply At each change in: H, Use function: Average, and Add subtotal to: as before, OK.
Replace =SUBTOTAL(1, in Min column with =MIN( .
Delete ‘spare’ Grand Average row.
Reformat as required.
Hopefully this will be achieved and is what is required!:
Note midnight 'tonight' is counted as within first hour of tomorrow.
I had a similar need and worked it out this way:
Add a column for Date (assuming your dd/mm/yyyy hh:mm:ss data is in cell A2)
=DATE(YEAR(A2),MONTH(A2),DAY(A2))
Add a column for Year. If you have weeks from a single year, the year column can be neglected.
=YEAR(A2)
Add a column for Week Number
=WEEKNUM(A2)
Add 2 pivot tables, 1 for daily and 1 for weekly analysis.
Choose fields "Date" and the quantities you want. Put "Date" in the Rows section and sum/average of values in the Values section. You will get a date wise sum/average of the values you need.
In the weekly pivot table, do the same as above, just add "Year" and "Week no" in the Rows section instead of "Dates" as in above.
Hope this helps

Resources