Python - Filtering Pandas Timestamp Index - python-3.x

Given Timestamp indices with many per day, how can I get a list containing only the last Timestamp of a day? So in case I have such:
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
I would like to get:
# End of Day:
EoD = [pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 15:38:55')]
Thx in advance!

Try pandas groupby
all = pd.Series(all)
all.groupby([all.dt.year, all.dt.month, all.dt.day]).max()
You get
2016 5 1 2016-05-01 23:56:37
2 2016-05-02 15:38:55

I've created an example dataframe.
import pandas as pd
all = [pd.Timestamp('2016-05-01 10:23:45'),
pd.Timestamp('2016-05-01 18:56:34'),
pd.Timestamp('2016-05-01 23:56:37'),
pd.Timestamp('2016-05-02 03:54:24'),
pd.Timestamp('2016-05-02 14:32:45'),
pd.Timestamp('2016-05-02 15:38:55')]
df = pd.DataFrame({'values':0}, index = all)
Assuming your data frame is structured as example, most importantly is sorted by index, code below is supposed to help you.
for date in set(df.index.date):
print(df[df.index.date == date].iloc[-1,:])
This code will for each unique date in your dataframe return last row of the slice so while sorted it'll return your last record for the day. And hey, it's pythonic. (I believe so at least)

Related

Add/Subtract UTC Time to Datetime 'Time' column

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
data = {'InsertedDate':['2022-01-21 20:13:19.000000', '2022-01-21 20:20:24.000000', '2022-02-
02 16:01:49.000000', '2022-02-09 15:01:31.000000'],
'UTCOffset': ['-05:00','+02:00','-04:00','+06:00']}
df = pd.DataFrame(data)
df['InsertedDate'] = pd.to_datetime(df['InsertedDate'])
df
The 'InsertedDate' is a datetime column wheres the 'UTCOffset' is a string column.
I want to add the Offset time to the 'Inserteddate' column and display the final result in a new column as a 'datetime' column.
It should look something like this image shown below.
Any help is greatly appreciated. Thank you!
You can use pd.to_timedelta for the offset and add with time.
# to_timedelta needs to have [+-]HH:MM:SS format, so adding :00 to fill :SS part.
df['UTCOffset'] = pd.to_timedelta(df.UTCOffset + ':00')
df['CorrectTime'] = df.InsertedDate + df.UTCOffset

Converting dates to numbers on Python [duplicate]

I have one field in a pandas DataFrame that was imported as string format.
It should be a datetime variable. How do I convert it to a datetime column and then filter based on date.
Example:
df = pd.DataFrame({'date': ['05SEP2014:00:00:00.000']})
Use the to_datetime function, specifying a format to match your data.
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
If you have more than one column to be converted you can do the following:
df[["col1", "col2", "col3"]] = df[["col1", "col2", "col3"]].apply(pd.to_datetime)
You can use the DataFrame method .apply() to operate on the values in Mycol:
>>> df = pd.DataFrame(['05SEP2014:00:00:00.000'],columns=['Mycol'])
>>> df
Mycol
0 05SEP2014:00:00:00.000
>>> import datetime as dt
>>> df['Mycol'] = df['Mycol'].apply(lambda x:
dt.datetime.strptime(x,'%d%b%Y:%H:%M:%S.%f'))
>>> df
Mycol
0 2014-09-05
Use the pandas to_datetime function to parse the column as DateTime. Also, by using infer_datetime_format=True, it will automatically detect the format and convert the mentioned column to DateTime.
import pandas as pd
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], infer_datetime_format=True)
chrisb's answer works:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'], format='%d%b%Y:%H:%M:%S.%f')
however it results in a Python warning of
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I would guess this is due to some chaining indexing.
Time Saver:
raw_data['Mycol'] = pd.to_datetime(raw_data['Mycol'])
To silence SettingWithCopyWarning
If you got this warning, then that means your dataframe was probably created by filtering another dataframe. Make a copy of your dataframe before any assignment and you're good to go.
df = df.copy()
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f')
errors='coerce' is useful
If some rows are not in the correct format or not datetime at all, errors= parameter is very useful, so that you can convert the valid rows and handle the rows that contained invalid values later.
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
# for multiple columns
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime, format='%d%b%Y:%H:%M:%S.%f', errors='coerce')
Setting the correct format= is much faster than letting pandas find out1
Long story short, passing the correct format= from the beginning as in chrisb's post is much faster than letting pandas figure out the format, especially if the format contains time component. The runtime difference for dataframes greater than 10k rows is huge (~25 times faster, so we're talking like a couple minutes vs a few seconds). All valid format options can be found at https://strftime.org/.
1 Code used to produce the timeit test plot.
import perfplot
from random import choices
from datetime import datetime
mdYHMSf = range(1,13), range(1,29), range(2000,2024), range(24), *[range(60)]*2, range(1000)
perfplot.show(
kernels=[lambda x: pd.to_datetime(x),
lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M:%S.%f'),
lambda x: pd.to_datetime(x, infer_datetime_format=True),
lambda s: s.apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))],
labels=["pd.to_datetime(df['date'])",
"pd.to_datetime(df['date'], format='%m/%d/%Y %H:%M:%S.%f')",
"pd.to_datetime(df['date'], infer_datetime_format=True)",
"df['date'].apply(lambda x: datetime.strptime(x, '%m/%d/%Y %H:%M:%S.%f'))"],
n_range=[2**k for k in range(20)],
setup=lambda n: pd.Series([f"{m}/{d}/{Y} {H}:{M}:{S}.{f}"
for m,d,Y,H,M,S,f in zip(*[choices(e, k=n) for e in mdYHMSf])]),
equality_check=pd.Series.equals,
xlabel='len(df)'
)
Just like we convert object data type to float or int. Use astype()
raw_data['Mycol']=raw_data['Mycol'].astype('datetime64[ns]')

Group pandas elements according to a column

I have the following pandas dataframe:
import pandas as pd
data = {'Sentences':['Sentence1', 'Sentence2', 'Sentence3', 'Sentences4', 'Sentences5', 'Sentences6','Sentences7', 'Sentences8'],'Time':[1,0,0,1,0,0,1,0]}
df = pd.DataFrame(data)
print(df)
I was wondering how to extract all the "Sentences" according to the "Time" column. I want to gather all the "sentences" from the first "1" to the last "0".
Maybe the expected output explains it better:
[[Sentences1,Sentences2,Sentences3],[Sentences4,Sentences5,Sentences6],[Sentences7,Sentences8]]
Is this somehow possible ? Sorry, I am very new to pandas.
Try this:
s = df['Time'].cumsum()
df.set_index([s, df.groupby(s).cumcount()])['Sentences'].unstack().to_numpy().tolist()
Output:
[['Sentence1', 'Sentence2', 'Sentence3'],
['Sentences4', 'Sentences5', 'Sentences6'],
['Sentences7', 'Sentences8', nan]]
Details:
Use cumsum to group by Time = 1 with following Time = 0.
Next, use groupby with cumcount to increment within each group
Lastly, use set_index and unstack to reshape dataframe.

Loop through a list of tickers with separate outputs

I have a list of tickers of which I would like to output individual datasets with financial information from pandas datareader.
I have tried to create a simple loop that takes a list of tickers and inputs it into the pandas datareader function.
import pandas as pd
import pandas_datareader as pdr
myTickers = ['AAPL', 'PG']
for ticks in myTickers:
print(ticks)
ticks = pdr.DataReader(ticks, 'yahoo', start='2019-01-01', end='2019-01-08')['Adj Close']
The problem here seems to be that the loop only substitutes in the myTickers values inside the DataReader function but it does not change the name of the dataframe from "ticks" to e.g. AAPL. Thereby all results will be overridden with whatever ticker loops last.
What do I need to modify in order for this loop to output two different dataframes with the names in the ticker list?
You can save in a DataFrame, after get a col like a dataframe with a function.
import pandas as pd
import pandas_datareader as pdr
myTickers = ['AAPL', 'PG']
df=pd.DataFrame(columns=myTickers)
for ticks in myTickers:
df[ticks] = pdr.DataReader(ticks, 'yahoo', start='2019-01-01', end='2019-01-08')['Adj Close']
def ticks(s):
return df[s].to_frame()
ticks('AAPL')
Output:
AAPL
Date
2019-01-02 156.049484
2019-01-03 140.505798
2019-01-04 146.503891
2019-01-07 146.177811
2019-01-08 148.964386
ticks('PG')
Output:
PG
Date
2019-01-02 89.350105
2019-01-03 88.723633
2019-01-04 90.534523
2019-01-07 90.172348
2019-01-08 90.505157
As you pointed out, the loop variable is forgotten, so needs to be stored somewhere. You could replace ticks in myTickers with it's corresponding DataFrame. However, a reference to the ticker will be useful. Perhaps the following may help.
tickers_df_dict = {
ticks: pdr.DataReader(ticks, 'yahoo', start='2019-01-01', end='2019-01-08')['Adj Close']
for ticks in myTickers
}
That being said, as far as I'm aware, using the yahoo API will result in the following error. You may need to revise your chosen data source.
ImmediateDeprecationError:
Yahoo Daily has been immediately deprecated due to large breaks in the API without the
introduction of a stable replacement. Pull Requests to re-enable these data
connectors are welcome.
See https://github.com/pydata/pandas-datareader/issues

extract information from single cells from pandas dataframe

I'm looking to pull specific information from the table below to use in other functions. For example extracting the volume on 1/4/16 to see if the volume traded is > 1 million. Any thoughts on how to do this would be greatly appreciated.
import pandas as pd
import pandas.io.data as web # Package and modules for importing data; this code may change depending on pandas version
import datetime
1, 2016
start = datetime.datetime(2016,1,1)
end = datetime.date.today()
apple = web.DataReader("AAPL", "yahoo", start, end)
type(apple)
apple.head()
Results:
The datareader will return a df with a datetimeIndex, you can use partial datetime string matching to give you the specific row and column using loc:
apple.loc['2016-04-01','Volume']
To test whether this is larger than 1 million, just compare it:
apple.loc['2016-04-01','Volume'] > 1000000
which will return True or False

Resources