I have a pandas dataframe which has a column called timeElapsed in seconds.
I take input from user to get a specific timestamp.
I want to add this specific timestamp with the timeElapsed column value.
For example:
user enters: 2021-07-08 10:00:00.0000
First entry in timeElapsed column is 80.1234.
New Column should be 2021-07-08 10:01:20.1234
so far, this is my code
import time
import pandas as pd
from datetime import datetime
df1 = pd.DataFrame({'userData': [1, 2, 3, 4, 5, 6, 7],
'timeElapsed': [0, 1.6427, 2.5185,5.3293,6.6699,37.4221,67.4378]})
takeDateInput = str(datetime.strptime(input("Enter current timestamp: YYYY-MM-DD HH:MM:SS.MS"),'%Y-%m-%d %H:%M:%S.%f'))
def myfunc2(x):
time.gmtime(x)
print(df1['timeElapsed'].apply(myfunc2))
I am trying to convert the seconds value to get a formatted hh:mm:ss timestamp using the myfun2. But I am not able to convert it. Is this the current approach?
Any direction as to how to achieve my final goal, would be much appreciated. Thank you
The timeElapsed value you're trying to add is best represented with a Timedelta. Keep the inputted timestamp as Datetime object (not a string), then you can just add the seconds as Timedelta:
takeDateInput = datetime.strptime(input("Enter current timestamp: YYYY-MM-DD HH:MM:SS.MS"),'%Y-%m-%d %H:%M:%S.%f')
def myfunc2(x):
return takeDateInput + pd.Timedelta(x, unit='sec')
print(df1['timeElapsed'].apply(myfunc2))
Related
I have a sample dataframe as given below.
import pandas as pd
import numpy as np
data = {'InsertedDate':['2022-01-21 20:13:19.000000', '2022-01-21 20:20:24.000000', '2022-02-
02 16:01:49.000000', '2022-02-09 15:01:31.000000'],
'UTCOffset': ['-05:00','+02:00','-04:00','+06:00']}
df = pd.DataFrame(data)
df['InsertedDate'] = pd.to_datetime(df['InsertedDate'])
df
The 'InsertedDate' is a datetime column wheres the 'UTCOffset' is a string column.
I want to add the Offset time to the 'Inserteddate' column and display the final result in a new column as a 'datetime' column.
It should look something like this image shown below.
Any help is greatly appreciated. Thank you!
You can use pd.to_timedelta for the offset and add with time.
# to_timedelta needs to have [+-]HH:MM:SS format, so adding :00 to fill :SS part.
df['UTCOffset'] = pd.to_timedelta(df.UTCOffset + ':00')
df['CorrectTime'] = df.InsertedDate + df.UTCOffset
I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions
I have a pandas dataframe with columns containing start and stop times in this format: 2016-01-01 00:00:00
I would like to convert these times to datetime objects so that I can subtract one from the other to compute total duration. I'm using the following:
import datetime
df = df['start_time'] =
df['start_time'].apply(lambda x:datetime.datetime.strptime(x,'%Y/%m/%d/%T %I:%M:%S %p'))
However, I have the following ValueError:
ValueError: 'T' is a bad directive in format '%Y/%m/%d/%T %I:%M:%S %p'
This would convert the column into datetime64 dtype. Then you could process whatever you need using that column.
df['start_time'] = pd.to_datetime(df['start_time'], format="%Y-%m-%d %H:%M:%S")
Also if you want to avoid explicitly specifying datetime format you can use the following:
df['start_time'] = pd.to_datetime(df['start_time'], infer_datetime_format=True)
Simpliest is use to_datetime:
df['start_time'] = pd.to_datetime(df['start_time'])
Given Timestamp indices with many per day, how can I get a list containing only the last Timestamp of a day? So in case I have such:
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
I would like to get:
# End of Day:
EoD = [pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 15:38:55')]
Thx in advance!
Try pandas groupby
all = pd.Series(all)
all.groupby([all.dt.year, all.dt.month, all.dt.day]).max()
You get
2016 5 1 2016-05-01 23:56:37
2 2016-05-02 15:38:55
I've created an example dataframe.
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
df = pd.DataFrame({'values':0}, index = all)
Assuming your data frame is structured as example, most importantly is sorted by index, code below is supposed to help you.
for date in set(df.index.date):
print(df[df.index.date == date].iloc[-1,:])
This code will for each unique date in your dataframe return last row of the slice so while sorted it'll return your last record for the day. And hey, it's pythonic. (I believe so at least)
I have microsecond resolution in my df which is very important but no matter what I try, I can't get excel to show microsecond resolution with either .xls or .xlsx. Any ideas on how to get them to display without converting to a string explicitly?
With the latest version of Pandas on GitHub (and in the soon to be released 0.13.1) you can specify the Excel date format in the ExcelWriter() like this:
import pandas as pd
from datetime import datetime
df = pd.DataFrame([datetime(2014, 2, 1, 12, 30, 5, 60000)])
writer = pd.ExcelWriter("time.xlsx", date_format='hh:mm:ss.000')
df.to_excel(writer, "Sheet1")
writer.close()
Which will display the microsecond times (or at least milliseconds since that is Excel's display limit):
See also Working with Python Pandas and XlsxWriter.