Convert datetime object to date and datetime2 to time then combine to single column - python-3.x

I have a dataset where the transaction date is stored as YYYY-MM-DD 00:00:00 and the transaction time is stored as 1900-01-01 HH:MM:SS
I need to truncate these timestamps and then either leave as is or convert to a singular timestamp. I've tried several methods and all continue to return the full timestamp. Thoughts?

Use split and pd.to_datetime:
df = pd.DataFrame({'TransDate':['2015-01-01 00:00:00','2015-01-02 00:00:00','2015-01-03 00:00:00'],
'TransTime':['1900-01-01 07:00:00','1900-01-01 08:30:00','1900-01-01 09:45:15']})
df['Date'] = (pd.to_datetime(df['TransDate'].str.split().str[0] +
' ' +
df['TransTime'].str.split().str[1]))
Output:
TransDate TransTime Date
0 2015-01-01 00:00:00 1900-01-01 07:00:00 2015-01-01 07:00:00
1 2015-01-02 00:00:00 1900-01-01 08:30:00 2015-01-02 08:30:00
2 2015-01-03 00:00:00 1900-01-01 09:45:15 2015-01-03 09:45:15
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
TransDate 3 non-null object
TransTime 3 non-null object
Date 3 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(2)
memory usage: 152.0+ bytes
None

Related

Pandas strptime returns a string instead of DateTime object [duplicate]

This question already has answers here:
How to convert string to datetime format in pandas python?
(3 answers)
Closed 1 year ago.
I have the following pandas DataFrame:
data = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"end_time": ["2016-01-13", "2016-01-01", "2016-11-12", "2016-01-17", "2016-03-13"]})
I want to transform the end_time column to a column of datetime objects. But when I do it like this (like it is suggested everywhere):
data["end"] = data["end_time"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
the output is still a string column:
id end_time end
0 1 2016-01-13 2016-01-13
1 2 2016-01-01 2016-01-01
2 3 2016-11-12 2016-11-12
3 4 2016-01-17 2016-01-17
4 5 2016-03-13 2016-03-13
How to solve this?
strftime is designed to return a string object, details.
If we want to convert end_time to datetime64[ns] and assign to new column named end then we can use:
data['end'] = pd.to_datetime(data.end_time)
strptime will also covert the string to datetime64[ns]. But preferable is to_datetime method.
data["end"] = data["end_time"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"))
data.info()
Output
id end_time end
0 1 2016-01-13 2016-01-13
1 2 2016-01-01 2016-01-01
2 3 2016-11-12 2016-11-12
3 4 2016-01-17 2016-01-17
4 5 2016-03-13 2016-03-13
Datatypes:
data.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5 non-null int64
1 end_time 5 non-null object
2 end 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 248.0+ bytes

How to get the minimum time value in a dataframe with excluding specific value

I have a dataframe that has the format as below. I am looking to get the minimum time value for each column and save it in a list with excluding a specific time value with a format (00:00:00) to be a minimum value in any column in a dataframe.
df =
10.0.0.155 192.168.1.240 192.168.0.242
0 19:48:46 16:23:40 20:14:07
1 20:15:46 16:23:39 20:14:09
2 19:49:37 16:23:20 00:00:00
3 20:15:08 00:00:00 00:00:00
4 19:48:46 00:00:00 00:00:00
5 19:47:30 00:00:00 00:00:00
6 19:49:13 00:00:00 00:00:00
7 20:15:50 00:00:00 00:00:00
8 19:45:34 00:00:00 00:00:00
9 19:45:33 00:00:00 00:00:00
I tried to use the code below, but it doesn't work:
minValues = []
for column in df:
#print(df[column])
if "00:00:00" in df[column]:
minValues.append (df[column].nlargest(2).iloc[-1])
else:
minValues.append (df[column].min())
print (df)
print (minValues)
Idea is replace 0 to missing values and then get minimal timedeltas:
df1 = df.astype(str).apply(pd.to_timedelta)
s1 = df1.mask(df1.eq(pd.Timedelta(0))).min()
print (s1)
10.0.0.155 0 days 19:45:33
192.168.1.240 0 days 16:23:20
192.168.0.242 0 days 20:14:07
dtype: timedelta64[ns]
Or with get minimal datetimes and last convert output to HH:MM:SS values:
df1 = df.astype(str).apply(pd.to_datetime)
s2 = (df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.strftime('%H:%M:%S')
print (s2)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object
Or to times:
df1 = df.astype(str).apply(pd.to_datetime)
s3 = df1.mask(df1.eq(pd.to_datetime("00:00:00"))).min().dt.time
print (s3)
10.0.0.155 19:45:33
192.168.1.240 16:23:20
192.168.0.242 20:14:07
dtype: object

Check whether a certain datetime value is missing in a given period

I have a df with DateTime index as follows:
DateTime
2017-01-02 15:00:00
2017-01-02 16:00:00
2017-01-02 18:00:00
....
....
2019-12-07 22:00:00
2019-12-07 23:00:00
Now, I want to know is there any time missing in the 1-hour interval. So, for instance, the 3rd reading is missing 1 reading as we went from 16:00 to 18:00 so is it possible to detect this?
Create date_range with minimal and maximal datetime and filter values by Index.isin with boolean indexing with ~ for inverting mask:
print (df)
DateTime
0 2017-01-02 15:00:00
1 2017-01-02 16:00:00
2 2017-01-02 18:00:00
r = pd.date_range(df['DateTime'].min(), df['DateTime'].max(), freq='H')
print (r)
DatetimeIndex(['2017-01-02 15:00:00', '2017-01-02 16:00:00',
'2017-01-02 17:00:00', '2017-01-02 18:00:00'],
dtype='datetime64[ns]', freq='H')
out = r[~r.isin(df['DateTime'])]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', freq='H')
Another idea is create DatetimeIndex with helper column, change frequency by Series.asfreq and filter index values with missing values:
s = df[['DateTime']].assign(val=1).set_index('DateTime')['val'].asfreq('H')
print (s)
DateTime
2017-01-02 15:00:00 1.0
2017-01-02 16:00:00 1.0
2017-01-02 17:00:00 NaN
2017-01-02 18:00:00 1.0
Freq: H, Name: val, dtype: float64
out = s.index[s.isna()]
print (out)
DatetimeIndex(['2017-01-02 17:00:00'], dtype='datetime64[ns]', name='DateTime', freq='H')
Is it safe to assume that the datetime format will always be the same? If yes, why don't you extract the "hour" values from your respective timestamps and compare them to the interval you desire, e.g:
import re
#store some datetime values for show
datetimes=[
"2017-01-02 15:00:00",
"2017-01-02 16:00:00",
"2017-01-02 18:00:00",
"2019-12-07 22:00:00",
"2019-12-07 23:00:00"
]
#extract hour value via regex (first match always is the hours in this format)
findHour = re.compile("\d{2}(?=\:)")
prevx = findHour.findall(datetimes[1])[0]
#simple comparison: compare to previous value, calculate difference, set previous value to current value
for x in datetimes[2:]:
cmp = findHour.findall(x)[0]
diff = int(cmp) - int(prevx)
if diff > 1:
print("Missing Timestamp(s) between {} and {} hours!".format(prevx, cmp))
prevx = cmp

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Mixed date formats dd/mm/yyyy and d/m/y in pandas

I have a dataframe with mixed date formats in a column. Some of it is in the format dd/mm/yyyy and some of it is in the format d/m/y. How can I set the column as datetime by applying the appropriate format depending on the value of the cell?
I am reading from a csv file:
DayofWeek,Date
Friday,22/05/2015
Friday,10/2/12
Friday,10/10/14
Friday,21/10/2011
Friday,8/7/11
df = pd.read_csv('dates.txt', parse_dates=['Date'], dayfirst=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
DayofWeek 5 non-null object
Date 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 100.0+ bytes
print(df)
DayofWeek Date
0 Friday 2015-05-22
1 Friday 2012-02-10
2 Friday 2014-10-10
3 Friday 2011-10-21
4 Friday 2011-07-08

Resources