How to convert date stored in YYYYMMDD to datetime format in pandas - python-3.x

I have a column in dataframe which is in YYYYMMDD format i want convert into datetime format . How can do in pandas.
Input
20180504
20180516
20180516
20180517
**Expected Output**
Date datetime
20180504 04/5/2018 00:00:00
20180516 16/5/2018 00:00:00
20180516 16/5/2018 00:00:00
20180517 17/5/2018 00:00:00

Use to_datetime:
df['datetime'] = pd.to_datetime(df['Input'], format='%Y%m%d')
print (df)
Input datetime
0 20180504 2018-05-04
1 20180516 2018-05-16
2 20180516 2018-05-16
3 20180517 2018-05-17
If zero times they are not displayed in column, but if convert to list you can see it:
print (df['datetime'].tolist())
[Timestamp('2018-05-04 00:00:00'),
Timestamp('2018-05-16 00:00:00'),
Timestamp('2018-05-16 00:00:00'),
Timestamp('2018-05-17 00:00:00')]
If input is csv file:
df = pd.read_csv(file, parse_dates=['Input'])
If want same format like in question it is possible, but output is strings, not datetimes:
df['datetime'] = pd.to_datetime(df['Input'], format='%Y%m%d').dt.strftime('%d/%m%Y %H:%M:%S')
print (df)
Input datetime
0 20180504 04/052018 00:00:00
1 20180516 16/052018 00:00:00
2 20180516 16/052018 00:00:00
3 20180517 17/052018 00:00:00
print (df['datetime'].tolist())
['04/052018 00:00:00', '16/052018 00:00:00', '16/052018 00:00:00', '17/052018 00:00:00']

Related

how to get employee count by Hour and Date using pySpark / python?

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.
Example Data
import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)
Example of output
import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)
Try this:
# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")
# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
[
pd.date_range(in_, out_, freq="H", inclusive="right")
for in_, out_ in zip(clock_in, clock_out)
]
).explode()
# Final result
hours.groupby(hours).count()
Result:
2021-03-30 02:00:00 1
2021-03-30 03:00:00 1
2021-03-30 04:00:00 1
2021-03-30 05:00:00 1
2021-08-07 01:00:00 1
2021-12-05 01:00:00 1
2021-12-05 02:00:00 1
2021-12-05 03:00:00 1
2021-12-05 04:00:00 1
2021-12-23 23:00:00 1
2021-12-24 00:00:00 2
2021-12-24 01:00:00 2
2021-12-24 02:00:00 1
dtype: int64
It's slightly different from your expected output but consistent with your business rules.

How to get the minimum time value in a dataframe with excluding specific value

I have a dataframe that has the format as below. I am looking to get the minimum time value for each column and save it in a list with excluding a specific time value with a format (00:00:00) to be a minimum value in any column in a dataframe.
df =
10.0.0.155 192.168.1.240 192.168.0.242
0 19:48:46 16:23:40 20:14:07
1 20:15:46 16:23:39 20:14:09
2 19:49:37 16:23:20 00:00:00
3 20:15:08 00:00:00 00:00:00
4 19:48:46 00:00:00 00:00:00
5 19:47:30 00:00:00 00:00:00
6 19:49:13 00:00:00 00:00:00
7 20:15:50 00:00:00 00:00:00
8 19:45:34 00:00:00 00:00:00
9 19:45:33 00:00:00 00:00:00
I tried to use the code below, but it doesn't work:
minValues = []
for column in df:
#print(df[column])
if "00:00:00" in df[column]:
minValues.append (df[column].nlargest(2).iloc[-1])
else:
minValues.append (df[column].min())
print (df)
print (minValues)
Idea is replace 0 to missing values and then get minimal timedeltas:
df1 = df.astype(str).apply(pd.to_timedelta)
s1 = df1.mask(df1.eq(pd.Timedelta(0))).min()
print (s1)
10.0.0.155 0 days 19:45:33
192.168.1.240 0 days 16:23:20
192.168.0.242 0 days 20:14:07
dtype: timedelta64[ns]
Or with get minimal datetimes and last convert output to HH:MM:SS values:
df1 = df.astype(str).apply(pd.to_datetime)
s2 = (df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.strftime('%H:%M:%S')
print (s2)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object
Or to times:
df1 = df.astype(str).apply(pd.to_datetime)
s3 = df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.time
print (s3)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object

How to read in unusual date\time format

I have a small df with a date\time column using a format I have never seen.
Pandas reads it in as an object even if I use parse_dates, and to_datetime() chokes on it.
The dates in the column are formatted as such:
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
I think the best approach is using a date parsing pattern. Something like this:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')
df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)
But I simply do not know how to approach this format.
The datatime format for UTC is very specific for converting the offset.
strftime() and strptime() Format Codes
The format must be + or - and then 00:00
Use str.zfill to backfill the 0s between the sign and the integer
+08:00 or -08:00 or +10:00 or -10:00
import pandas as pd
# sample data
df = pd.DataFrame({'datetime': ['2019/12/29 GMT+8 18:00', '2019/12/15 GMT+8 05:00', '2019/12/15 GMT+10 05:00', '2019/12/15 GMT-10 05:00']})
# display(df)
datetime
2019/12/29 GMT+8 18:00
2019/12/15 GMT+8 05:00
2019/12/15 GMT+10 05:00
2019/12/15 GMT-10 05:00
# fix the format
df.datetime = df.datetime.str.split(' ').apply(lambda x: x[0] + x[2] + x[1][3:].zfill(3) + ':00')
# convert to a utc datetime
df.datetime = pd.to_datetime(df.datetime, format='%Y/%m/%d%H:%M%z', utc=True)
# display(df)
datetime
2019-12-29 10:00:00+00:00
2019-12-14 21:00:00+00:00
2019-12-14 19:00:00+00:00
2019-12-15 15:00:00+00:00
print(df.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 datetime 4 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1)
memory usage: 160.0 bytes
You could pass the custom format with GMT+8 in the middle and then subtract eight hours with timedelta(hours=8):
import pandas as pd
from datetime import datetime, timedelta
df['Date'] = pd.to_datetime(df['Date'], format='%Y/%m/%d GMT+8 %H:%M') - timedelta(hours=8)
df
Date
0 2019-12-29 10:00:00
1 2019-12-14 21:00:00

Check whether a certain datetime value is missing in a given period

I have a df with DateTime index as follows:
DateTime
2017-01-02 15:00:00
2017-01-02 16:00:00
2017-01-02 18:00:00
....
....
2019-12-07 22:00:00
2019-12-07 23:00:00
Now, I want to know is there any time missing in the 1-hour interval. So, for instance, the 3rd reading is missing 1 reading as we went from 16:00 to 18:00 so is it possible to detect this?
Create date_range with minimal and maximal datetime and filter values by Index.isin with boolean indexing with ~ for inverting mask:
print (df)
DateTime
0 2017-01-02 15:00:00
1 2017-01-02 16:00:00
2 2017-01-02 18:00:00
r = pd.date_range(df['DateTime'].min(), df['DateTime'].max(), freq='H')
print (r)
DatetimeIndex(['2017-01-02 15:00:00', '2017-01-02 16:00:00',
'2017-01-02 17:00:00', '2017-01-02 18:00:00'],
dtype='datetime64[ns]', freq='H')
out = r[~r.isin(df['DateTime'])]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', freq='H')
Another idea is create DatetimeIndex with helper column, change frequency by Series.asfreq and filter index values with missing values:
s = df[['DateTime']].assign(val=1).set_index('DateTime')['val'].asfreq('H')
print (s)
DateTime
2017-01-02 15:00:00 1.0
2017-01-02 16:00:00 1.0
2017-01-02 17:00:00 NaN
2017-01-02 18:00:00 1.0
Freq: H, Name: val, dtype: float64
out = s.index[s.isna()]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', name='DateTime', freq='H')
Is it safe to assume that the datetime format will always be the same? If yes, why don't you extract the "hour" values from your respective timestamps and compare them to the interval you desire, e.g:
import re
#store some datetime values for show
datetimes=[
"2017-01-02 15:00:00",
"2017-01-02 16:00:00",
"2017-01-02 18:00:00",
"2019-12-07 22:00:00",
"2019-12-07 23:00:00"
]
#extract hour value via regex (first match always is the hours in this format)
findHour = re.compile("\d{2}(?=\:)")
prevx = findHour.findall(datetimes[1])[0]
#simple comparison: compare to previous value, calculate difference, set previous value to current value
for x in datetimes[2:]:
cmp = findHour.findall(x)[0]
diff = int(cmp) - int(prevx)
if diff > 1:
print("Missing Timestamp(s) between {} and {} hours!".format(prevx, cmp))
prevx = cmp

read date time column into pandas dataframe. retain seconds information in the dataframe

My csv file.
Timestamp
---------------------
1/4/2019 2:00:09 PM
1/4/2019 2:00:18 PM
I have a column date time information in a csv file . I want to read this as a timestamp column into a pandas dataframe. I want to retain the seconds information.
Effort 1:
I tried
def dateparse (timestamp):
return pd.datetime.strptime(timestamp, '%m/%d/%Y %H:%M:%S ')
df = pd.read_csv('file_name.csv', parse_dates['Timestamp'],date_parser=dateparse)
Above rounds off the seconds to something like
1/4/2019 2:00:00
Effort 2:
I thought of reading the entire file using and later convert it into dataframe.
with open('file name.csv') as f:
for line in f:
print(line)
But again here seconds information is rounded off.
edit 1:
The seconds info is truncated when I open this csv file in editors like sublime.
For me working omit date_parser=dateparse:
import pandas as pd
temp=u"""Timestamp1
1/4/2019 2:00:09 PM
1/4/2019 2:00:18 PM"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['Timestamp1'])
print (df)
Timestamp1
0 2019-01-04 14:00:09
1 2019-01-04 14:00:18
print (df.dtypes)
Timestamp1 datetime64[ns]
dtype: object
EDIT1:
Correct format of datetimes should be changed:
import pandas as pd
def dateparse (timestamp):
return pd.datetime.strptime(timestamp, '%m/%d/%Y %I:%M:%S %p')
temp=u"""Timestamp1
1/4/2019 2:00:09 AM
1/4/2019 2:00:09 PM
1/4/2019 2:00:18 PM"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=['Timestamp1'],date_parser=dateparse)
print (df)
Timestamp1
0 2019-01-04 02:00:09
1 2019-01-04 14:00:09
2 2019-01-04 14:00:18
print (df.dtypes)
Timestamp1 datetime64[ns]
dtype: object
EDIT2:
df = pd.read_csv('send1.csv', parse_dates=['Timestamp'])
print (df)
Timestamp
0 2019-01-04 14:00:00
1 2019-01-04 14:00:00
2 2019-01-04 14:00:00
3 2019-01-04 14:00:00
4 2019-01-04 14:00:00
5 2019-01-04 14:00:00

Resources