Pandas Create Range of Dates Without Weekends - python-3.x

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['a','b','c'],
'first_date':['2015-08-31 00:00:00','2015-08-24 00:00:00','2015-08-25 00:00:00']})
df.first_date=pd.to_datetime(df.first_date) #(dtype='<M8[ns]')
df['last_date']=pd.to_datetime('5/6/2016') #(dtype='datetime64[ns]')
df
A first_date last_date
0 a 2015-08-31 2016-05-06
1 b 2015-08-24 2016-05-06
2 c 2015-08-25 2016-05-06
I'd like to create a new column which contains the list (or array) of dates between 'first_date' and 'last_date' which excludes weekends.
So far, I've tried this:
pd.date_range(df['first_date'],df['last_date'])
...but this error occurs:
TypeError: Cannot convert input to Timestamp
I also tried this before pd.date_range...
pd.Timestamp(df['first_date'])
...but no dice.
Thanks in advance!
P.S.:
After this hurdle, I'm going to try looking at other lists of dates and if they fall within the generated array (per row in 'A'), then subtract them out of the list or array). I'll post it as a separate question.

freq='B' gives you business days, or no weekends.
Your error:
TypeError: Cannot convert input to Timestamp
Is the result of you passing a series to the pd.date_range function when it is expecting a Timestamp
Instead, use apply.
However, I still find it tricky to get lists into specific cells of dataframes. The way I use is to use a pd.Series([mylist]). Notice it is a list of a list. If it were just pd.Series(mylist) pandas would convert the list into a series and you'd get a series of series which is a dataframe.
try:
def fnl(x):
l = pd.date_range(x.loc['first_date'], x.loc['last_date'], freq='B')
return pd.Series([l])
df['range'] = df.apply(fnl, axis=1)

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Converting simple returns to monthly log returns

I have a pandas DataFrame with simple daily returns. I need to convert it to monthly log returns and add a column to the current DataFrame. I have to use np.log to compute the monthly return. But I can only compute daily log return. Below is my code.
df[‘return_monthly’]= np.log(data([‘Simple Daily Returns’]+1)
The code only produces daily log returns. Is there any particular methods I should be using in the above code to get monthly return??
Please see my input for pandas Dataframe, the third column in excel is the expected out.
The question is a little confusing, but it seems like you want to group the rows by month. This can be done with pandas.resample if you have a datetime index, pandas.groupby, or pandas.pivot.
Here is a simple implementation, let us know if this isn't what you're looking for. Furthermore, your values are less than 1, so the log is negative. You can adjust as needed. I aggregated the months with sum, but there are many other aggregation functions such as mean(), median(), size() and many more. See the link for a list of aggregating functions.
#create dataframe with 1220 values that match your dataset
df = pd.DataFrame({
'Date':pd.date_range(start = '1/1/2019' , end ='5/4/2022' , freq='1D'),
'Return':np.random.uniform(low=1e-6, high=1.0, size=1220) #avoid log 0 which returns NAN
}).set_index('Date') #set the index to the date so we can use resample
Return Log_return
Date
2019-01-31 14.604863 -33.950987
2019-02-28 13.118111 -32.025086
2019-03-31 14.541947 -32.962914
2019-04-30 14.212689 -33.684422
2019-05-31 14.154918 -33.347081
2019-06-30 10.710209 -43.474120
2019-07-31 12.358001 -43.051723
2019-08-31 17.932673 -30.328784
...

Date stuck as unformattable in pandas dataframe

I am trying to plot time series data and my date column is stuck like this, and I cannot seem to figure out what datatype it is to change it, as adding verbose = True doesn't yield any explanation for the data.
Here is a screenshot of the output Date formatting
Here is the code I have for importing the dataset and all the formatting I've done to it. Ultimately, I want date and time to be separate values, but I cannot figure out why it's auto formatting it and applying today's date.
df = pd.read_csv('dataframe.csv')
df['Avg_value'] = (df['Open'] + df['High'] + df['Low'] + df['Close'])/4
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
Any help would be appreciated
The output I'd be looking for would be something like:
Date: 2016-01-04 10:00:00
As one column rather than 2 separate ones.
When you pass a Pandas Series into pd.to_datetime(...), it parses the values and returns a new Series of dtype datetime64[ns] containing the parsed values:
>>> pd.to_datetime(pd.Series(["12:30:00"]))
0 2021-08-24 12:30:00
dtype: datetime64[ns]
The reason you see today's date is that a datetime64 value must have both date and time information. Since the date is missing, the current date is substituted by default.
Pandas does not really have a dtype that supports "just time of day, no date" (there is one that supports time intervals, such as "5 minutes"). The Python standard library does, however: the datetime.time class.
Pandas provides a convenience function called the .dt accessor for extracting a Series of datetime.time objects from a Series of datetime64 objects:
>>> pd.to_datetime(pd.Series(["12:30:00"])).dt.time
0 12:30:00
dtype: object
Note that the dtype is just object which is the fallback dtype Pandas uses for anything which is not natively supported by Pandas or NumPy. This means that working with these datetime.time objects will be a bit slower than working with a native Pandas dtype, but this probably won't be a bottleneck for your application.
Recommended reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

Pandas: Ignore specific (bad) cells while doing sum(), mean() operations

I want to execute sum, mean operations on 'number' column using pandas library in python but some cells contains wrong data (2020-05-30) or they are empty. How can ignore those cells?
number
25
1
12
2020-05-30
6
7
...
Thank you.
Convert wrong values to missing values NaNs, pandas by default for sum, mean omit them:
df['number'] = pd.to_numeric(df.number, errors='coerce')
Or then remove rows with missing values by DataFrame.dropna:
df['number'] = pd.to_numeric(df.number, errors='coerce')
df = df.dropna(subset=['number'])

P-value normal test for multiple rows

I got the following simple code to calculate normality over an array:
import pandas as pd
df = pd.read_excel("directory\file.xlsx")
import numpy as np
x=df.iloc[:,1:].values.flatten()
import scipy.stats as stats
from scipy.stats import normaltest
stats.normaltest(x,axis=None)
This gives me nicely a p-value and a statistic.
The only thing I want right now is to:
Add 2 columns in the file with this p value and statistic and if i have multiple rows, do it for all the rows (calculate p value & statistic for each row and add 2 columns with these values in it).
Can someone help?
If you want to calculate row-wise normaltest, you should not flatten your data in x and use axis=1 such as
df = pd.DataFrame(np.random.random(105).reshape(5,21)) # to generate data
# calculate normaltest row-wise without the first column like you
df['stat'] ,df['p'] = stats.normaltest(df.iloc[:,1:],axis=1)
Then df contains two columns 'stat' and 'p' with the values your are looking for IIUC.
Note: to be able to perform normaltest, you need at least 8 values (according to what I experienced) so you need at least 8 columns in df.iloc[:,1:] otherwise it will raise an error. And even, it would be better to have more than 20 values in each row.

Resources