I have a dataframe called prices, with historical stocks prices for the following companies:
['APPLE', 'AMAZON', 'GOOGLE']
So far on, with the help of a friendly user, I was able to create a dataframe for each of this periods with the following code:
import pandas as pd
import numpy as np
from datetime import datetime, date
prices = pd.read_excel('database.xlsx')
companies=prices.columns
companies=list(companies)
del companies[0]
timestep = 250
prices_list = [prices[day:day + step] for day in range(len(prices) - step)]
Now, I need to evaluate the change in price for every period of 251 days (Price251/Price1; Price252/Price2; Price 253/Price and so on) for each one of the companies, and create a column for each one of them.
I would also like to put the column name dynamic, so I can replicate this to a much longer database.
So, I would get a dataframe similar to this:
open image here
Here you can find the dataframe head(3): Initial Dataframe
IIUC, try this:
def create_cols(df,num_dates):
for col in list(df)[1:]:
df['{}%'.format(col)] = - ((df['{}'.format(col)].shift(num_dates) - df['{}'.format(col)]) / df['{}'.format(col)].shift(num_dates)).shift(- num_dates)
return df
create_cols(prices,251)
you only would have to format the columns to percentages.
Related
I have an Excel .xlsb sheet with data, some columns have number as output data, other columns should have dates as output. After uploading the data in Python, some columns have a number in stead of date. How can I change the format of the number in that specific column to a date?
I use Pandas and ddf
The output of the dataframe of column date of birth ('dob_l1') shows '12150', which should be date '6-4-1933'.
I tried to solve this, but unfortunately I only managed to get the date '2050-01-12' which is incorrect.
I used code 'ddf['nwdob_l1'] = pd.to_datetime(ddf['dob_l1'], format='%d%m%y',errors='coerce')'
Who can help me. I was happy to received some good feedback from joe90. He showed me a function that could help for singular dates:
import datetime
def xldate2date(xl):
# valid for dates from 1900-03-01
basedate = datetime.date(1899,12,30)
d = basedate + datetime.timedelta(days=xl)
return d
# Example:
# >>> print(xldate2date(44948))
# 2023-01-22
That is correct, however, I need to change all values in the column (> 500.000), so I cannot do that 1-by-1.
As that question is closed, I hereby open a new question.
Is there anyone who can help me to find the correct code to get the right date in the whole column?
When you read the data in using pandas there are tools for the dates. You want to use parse_dates
Documentation for read_excel
example:
import pandas as pd
df = pd.read_excel('file/path/the.xlsx', parse_dates=['Date'])
This will change the date to be datetime64 format which is better than a number.
I'm using Python 3.9 with Pandas and Numpy.
Every day I receive a df with orders from the company I work for. Each day, this df comes from a different country that I don't know the language, and this dataframes don't have a pattern. In this case, I don't know what's the column name nor the index.
I just know that the orders follows a patter: 3 numbers + 2 letters like 000AA, 149KL, 555EE etc.
I saw that with strings is possible, but with pandas I just found commands that needs the name of the column.
df.column_name.str.contains(pat=r'\d\d\d\w\w', regex=True)
If I can find the column that only have this pattern, I know what the orders column is.
I started with a synthetic data set
import pandas
df = pandas.DataFrame([{'a':3,'b':4,'c':'222BB','d':'2asf'},
{'a':2,'b':1,'c':'111AA','d':'942'}])
I then cycle through each column. If the datatype is object, then I test whether all the elements in the Series match the regex
for column_id in df.columns:
if df[column_id].dtype=='object':
if all(df[column_id].str.contains(pat=r'\d\d\d\w\w', regex=True)):
print("matching column:",column_id)
I have the following dataframe:
I need to resample the data to calculate the weekly pct_change(). How can i get the weekly change ?
Something like data['pct_week'] = data['Adj Close'].resample('W').ffill().pct_change() but the data need to groupby data.groupby(['month', 'week'])
This way every month would yield 4 values for weekly change.Which i can graph then
What i did was df['pct_week'] = data['Adj Close'].groupby(['week', 'day']).pct_change() but i got this error TypeError: 'type' object does not support item assignment
If want grouping with resample first is necessary DatetimeIndex only, so added
DataFrame.reset_index by all levels without first, then grouping and resample with custom function, because pct_change for resample is not implemented:
def percent_change(x):
return pd.Series(x).pct_change()
Another idea is use numpy solution for pct_change:
def percent_change(x):
return x / np.concatenate(([np.nan], x[:-1])) - 1
df1 = (df.reset_index(level=[1,2,3])
.groupby(['month', 'week'])['Adj Close']
.resample('W')
.apply(percent_change))
this way every month would yield 4 values for weekly change
So it seems there is no groupby, only necessary downsample like sum and chain Series.pct_change:
df2 = (df.reset_index(level=[1,2,3])
.resample('W')['Adj Close']
.sum()
.pct_change())
Drop unwanted indexes. The datetime index is enough for re-sampling / grouping
df.index = df.index.droplevel(['month', 'week', 'day'])
Re-sample by week, select the column needed, add a aggregation function and then calculate percentage change.
df.resample('W')['Adj Close'].mean().pct_change()
I have df that with 3 columns on timestamps:
X. ...
01/01/2013 12:00:20 AM. ...
so I have been trying to convert these columns into the DateTime format for some further analysis
When I run:
df.dtype()
the info comes back with each of these columns as objects. I have been reading the data in from a csv so they should be string objects.
When converting them to DateTime I have been using:
df['X'] = pd.to_datetime(df['X'])
and
df['X'] = df['X'].astype('datetime64[ns]')
But in every case, the kernel just keeps running and I am not getting anywhere... I want to be able to use these dates and times to calculate the difference between timestamp columns in minutes and such.
Any help would be greatly appreciated. Thank You.
Here is a full example that works with me.You can try it out in your own setup:
import pandas as pd
df=pd.DataFrame([["1/1/2016 12:00:20 AM","3/1/2016"],
["6/15/2016 4:00:20 AM","7/14/2016"],
["7/14/2016 11:00:20 AM","8/15/2016"],
["8/7/2016 00:00:20 AM","9/6/2016"]]
,columns=['X','Y'])
print(df)
#convert one column
df['X'] = pd.to_datetime(df['X'])
print(df)
#convert all columns
df[df.columns] = df[df.columns].apply(pd.to_datetime)
print(df)
we have measurements for n points (say 22 points) over a period of time stored in a real time store. now we are looking for some understanding of trends for points mentioned above. In order to gain an objective we read measurements into a pandas DataFrame (python). Within this DataFrame points are now columns and rows are respective measurement time.
We would like to extend data frame with new columns for mean and std by inserting 'mean' and 'std' columns for each existing column, being a particular measurement. This means two new columns per 22 measurement points.
Now question is whether above is best achieved adding new mean and std columns while iterating existing columns or is there another more effective DataFrame built in operation or tricks?
Our understanding is that updating of DataFrame in a for loop would by far be worst practice.
Thanks for any comment or proposal.
From the comments, I guess this is what you are looking for -
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.normal(size = (1000,22))) # created an example dataframe
df.loc[:, 'means'] = df.mean(axis = 1)
df.loc[:, 'std'] = df.mean(axis = 1)