I have dataframe with timestamp values and i want to round off the timestamp to upper minute.
But i am not getting the desired output.
I have used the following link to do the same:
[1]: http://stackoverflow.com/questions/32344533/how-do-i-round-datetime-column-to-nearest-quarter-hour/32344636
how can i do that using pandas?
example: output:
05:06:34 05:07
05:09:43 05:10
You can write a function that will round up to the nearest minute:
Code:
import datetime as dt
def round_up_to_minute(timestamp):
return timestamp + dt.timedelta(
seconds=60-timestamp.second,
microseconds=-timestamp.microsecond)
Test Code:
To apply it to a dataframe, you can do something like:
now = dt.datetime.now()
now_plus = now + dt.timedelta(minutes=1)
df = pd.DataFrame(data=[[now, now], [now_plus, now_plus]],
columns=['datetime', 'datetime_rounded'])
df['datetime_rounded'] = df['datetime_rounded'].apply(round_up_to_minute)
print(df)
Results:
datetime datetime_rounded
0 2017-03-06 07:36:54.794 2017-03-06 07:37:00
1 2017-03-06 07:37:54.794 2017-03-06 07:38:00
Some time before September 2022 pandas incorporated DatetimeIndex.ceil().
dt_index = pd.to_datetime(['1970-01-01 05:06:34','1970-01-01 05:09:43'])
print(dt_index)
print(dt_index.ceil('1min'))
Outputs
DatetimeIndex(['1970-01-01 05:06:34', '1970-01-01 05:09:43'], dtype='datetime64[ns]', freq=None)
DatetimeIndex(['1970-01-01 05:07:00', '1970-01-01 05:10:00'], dtype='datetime64[ns]', freq=None)
DatetimeIndex also has floor() and round().
Related
I'm trying to generate multiple random datetime stamps between two dates,
I tried with using the below code based on the existing post and question, but it generates only a single random date time.
import datetime
import random
import pandas as pd
min_date = pd.to_datetime('2019-01-01 00:00:00')
max_date = pd.to_datetime('2019-01-01 23:59:59')
start + datetime.timedelta(seconds=random.randint(0, int((end - start).total_seconds())),)
>>> Timestamp('2019-09-27 05:58:40')
Is there a way to generate multiple date time based on the size mentioned. Say if the size is mentioned as 100, it should generate 100 random date timestamps objects similar to the output as mentioned above. Also I want to store the 100 time stamps in a pandas dataframe.
Try this:
import datetime
import random
import pandas as pd
min_date = pd.to_datetime('2019-01-01 00:00:00')
max_date = pd.to_datetime('2019-01-01 23:59:59')
for x in range(100):
print(start + datetime.timedelta(seconds=random.randint(0, int((end - start).total_seconds())),))
It will generate 100 random timestamps
N = 100
diff = (max_date - min_date).total_seconds() + 1
offsets = random.sample(range(int(diff)), k=N)
result = min_date + pd.TimedeltaIndex(offsets, unit="s")
get the difference between start & end in seconds
add 1 because range used next is end-exclusive
sample N seconds from 0 to diff, and convert it to TimedeltaIndex for vectorized addability
add those offsets to the starting date
Example run:
In [60]: N = 10
...: diff = (max_date - min_date).total_seconds() + 1
...: offsets = random.sample(range(int(diff)), k=N)
...: result = min_date + pd.TimedeltaIndex(offsets, unit="s")
In [61]: result
Out[61]:
DatetimeIndex(['2019-01-01 16:30:47', '2019-01-01 00:05:32',
'2019-01-01 02:35:15', '2019-01-01 21:25:09',
'2019-01-01 19:09:26', '2019-01-01 06:25:37',
'2019-01-01 07:28:47', '2019-01-01 00:25:18',
'2019-01-01 17:04:58', '2019-01-01 05:15:46'],
dtype='datetime64[ns]', freq=None)
what's returned is a DatetimeIndex but .tolist()ing it would give a list of Timestamps if so desired.
I have been exploring the methods and properties in pd.Timestamp, and pd.DatetimeIndex, but so far have not been able to find a way to get the TimeZone that pandas is adopting; like 'US/Eastern' for a US locale system.
One would assume that pandas would adopt the TimeZone specified in the system locale when converting datetime string like '2022-03-03 17:15:00' into Epoch value.
We could find timezone information using the time module:
time.tzname => ('EST','EDT')
I am wondering in pandas, how do we get the default timezone it is adopting ?
I believe its naive and no tz is assumed.
You can of course specify utc=True and then convert to a specific tz.
import pandas as pd
data = {
"START_TIME": ["2022-06-27 09:30:19", "2022-08-20 11:55:25"],
"STOP_TIME": ["2022-06-27 12:30:00", "2022-08-20 13:00:00"]
}
df = pd.DataFrame(data)
print(df)
START_TIME STOP_TIME
0 2022-06-27 09:30:19 2022-06-27 12:30:00
1 2022-08-20 11:55:25 2022-08-20 13:00:00
for column in [x for x in df.columns[df.columns.str.contains("time", case=False)]]:
df[column] = (
pd.to_datetime(df[column], utc=True)
.dt.tz_convert("America/New_York")
)
print(df)
START_TIME STOP_TIME
0 2022-06-27 05:30:19-04:00 2022-06-27 08:30:00-04:00
1 2022-08-20 07:55:25-04:00 2022-08-20 09:00:00-04:00
If you know the data is already in a specific tz but it is naive you can also make it tz aware.
for column in [x for x in df.columns[df.columns.str.contains("time", case=False)]]:
df[column] = (
pd.to_datetime(df[column], utc=False)
.dt.tz_localize("America/New_York")
)
print(df)
START_TIME STOP_TIME
0 2022-06-27 09:30:19-04:00 2022-06-27 12:30:00-04:00
1 2022-08-20 11:55:25-04:00 2022-08-20 13:00:00-04:00
I have a Pandas df of Stock Tickers with specific dates, I want to add the adjusted close for that date using yahoo finance. I iterate through the dataframe, do the yahoo call for that Ticker and Date, and the correct information is returned. However, I am not able to add that information back to the original df. I have tried various loc, iloc, and join methods, and none of them are working for me. The df shows the initialized zero values instead of the new value.
import pandas as pd
import yfinance as yf
from datetime import timedelta
# Build the dataframe
df = pd.DataFrame({'Ticker':['BGFV','META','WIRE','UG'],
'Date':['5/18/2021','5/18/2021','4/12/2022','6/3/2019'],
})
# Change the Date to Datetime
df['Date'] = pd.to_datetime(df.Date)
# initialize the adjusted close
df['Adj_Close'] = 0.00 # You'll get a column of all 0s
# iterate through the rows of the df and retrieve the Adjusted Close from Yahoo
for i in range(len(df)):
ticker = df.iloc[i]['Ticker']
start = df.iloc[i]['Date']
end = start + timedelta(days=1)
# YF call
data = yf.download(ticker, start=start, end=end)
# Get just the adjusted close
adj_close = data['Adj Close']
# Write the acjusted close to the dataframe on the correct row
df.iloc[i]['Adj_Close'] = adj_close
print(f'i value is {i} and adjusted close value is {adj_close} \n')
print(df)
The simplest way to do is to use loc as below-
# change this line
df.loc[i,'Adj_Close'] = adj_close.values[0]
You can use:
def get_adj_close(x):
# You needn't specify end param because period is already set to 1 day
df = df = yf.download(x['Ticker'], start=x['Date'], progress=False)
return df['Adj Close'][0].squeeze()
df['Adj_Close'] = df.apply(get_adj_close, axis=1)
Output:
>>> df
Ticker Date Adj_Close
0 BGFV 2021-05-18 27.808811
1 META 2021-05-18 315.459991
2 WIRE 2022-04-12 104.320045
3 UG 2019-06-03 16.746983
The problem is the dataset has variable data rates per ID, I would like to filter out the IDs that do not have at least one data point per day.
I have a dataframe with IDs, dates, and data, in which I counted the daily sampling rate for each ID.
dfcounted = df.reset_index().groupby(['id', pd.Grouper(key='datetime', freq='D')]).count().reset_index()
Now, i have taken the first and last date of the dataframe, and created a dataframe of each day between the starting and ending dates:
# take dates
sdate = df['datetime'].min() # start date
edate = df['datetime'].max() # end date
# interval
delta = edate - sdate # as timedelta
# empty list
dates = []
# store each date in list
for i in range(delta.days + 1):
day = sdate + timedelta(days=i)
dates.append(day)
# convert to dataframe
dates = pd.DataFrame(data = dates, columns=["date"])
From here, I am lost on how to proceed. I have created a sample dataframe
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
import string
letters = string.ascii_lowercase
ids = random.choices(letters,k=100)
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(99), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'date': days,'ids': ids, 'data': data})
df = df.set_index('date')
With the sample df, i would expect to create a "results" df with only the ids that have data in each date.
I have a pandas dataframe that looks like:
import pandas as pd
df1 = pd.DataFrame({'Counterparty':['Bank','Client','Bank','Bank','Bank','Bank'],
'Date':['4Q18','1Q19','2Q19','4Q21','FY22','H123']
})
I want to convert the 'Date' column from a string to a date such that the date is the last date for that particular period. ie 'FQ18'= 31st Dec 2018, '1Q19' = 31st Mar 2019, 'FY22' = 31st Dec 2022,'H123'= 30th June 2023
Any suggestions how to achieve this ?
As mentioned by #jpp, you're going to have to do some customization. There isn't existing functionality to map "FY22" to 2022-12-31, to my knowledge. Here's something to get you started, based on the limited example you've shown:
import re
import pandas as pd
from pandas.core.tools.datetimes import DateParseError
from pandas.tseries import offsets
halfyr = re.compile(r'H(?P<half>\d)(?P<year>\d{2})')
fiscalyr = re.compile(r'FY(?P<year>\d{2})')
def try_qend(date):
try:
return pd.to_datetime(date) + offsets.QuarterEnd()
except (DateParseError, ValueError):
halfyr_match = halfyr.match(date)
if halfyr_match:
half, year = [int(i) for i in halfyr_match.groups()]
month = 6 if half == 1 else 12
return pd.datetime(2000 + year, month, 1) + offsets.MonthEnd()
else:
fiscalyr_match = fiscalyr.match(date)
if fiscalyr_match:
year = int(fiscalyr_match.group('year'))
return pd.datetime(2000 + year, 12, 31)
else:
# You're SOL
return pd.NaT
def parse_dates(dates):
return pd.to_datetime([try_qend(date) for date in dates])
Assumptions:
All years are 20yy, not 19xx.
The regex patterns here completely describe the year-half/fiscal-year syntax set.
Example:
dates = ['4Q18','1Q19','2Q19','4Q21','FY22','H123']
parse_dates(dates)
DatetimeIndex(['2018-12-31', '2019-03-31', '2019-06-30', '2021-12-31',
'2022-12-31', '2023-06-30'],
dtype='datetime64[ns]', freq=None)