I have sales data from Jan 2014 until last week and data will refresh everyday.
I want to generate some insights automatically to compare to the latest week, for example how much sales decreased/increased from last week to this week and which is the hot product etc.
I am confused with how to store latest week dynamically
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Product': ['EyeWear', 'Packs', 'Watches', 'Irons', 'Glasses'],
'Country':['USA','India','Africa','UK','India'],
'Revenue':[98,90,87,69,78],
'Date':['20140101','20140102','20140103','20140104','20140105']},
index=[1,2,3,4,5])
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['week'] = df['Date'].dt.week
df['YearMonth'] = df['Date'].apply(lambda x:x.strftime('%Y%m'))
Related
I following code generates a small dataframe that is intended to be a fictitious Olympics medal table.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 47, 20).reshape(4,5),
index = ['USA', 'USR', 'ITL', 'GBR'],
columns=[1996, 2000, 2004, 2008, 2102])
df['Highest'] = df.max(axis=1).round()
df = df.sort_values('Highest', ascending = False).head(10)
df
I have added a column at the end to establish the highest medal tally per row (Country).
I need to add an additional 'Year' column that adds the year in which the highest medal tally was won for each row.
So, if the highest number of medals on row 1 was won in the year 2012, the value of 2012 should be added in row 1 of the new 'Year' column.
How can I do that?
Thanks
Here's one option to find the index location, then find the Year. You can adapt for your purpose as needed. Create random df first.
Using .index gives a list; in this case the list is one element at the max, so use [0] to get the value from the list
Then use .at to get the year at the max value.
df = pd.DataFrame(data={'Year': range(2000, 2010), 'Value': np.random.uniform(low=0.5, high=13.3, size=(10,))}, columns=['Year', 'Value'])
max_value = df.Value.max()
idx_max_value = df.loc[df.Value == max_value].index[0]
year_at_max_value = df.at[idx_max_value,'Year']
Probably not the most Pythonic solution, but this works:
year = []
for x in range(len(df)):
pip = np.array(df.iloc[x, :5])
i = np.argmax(pip)
year.append(df.columns[i])
df['Year'] = year
I am trying to test stock algorithms using historical data, I want to be able to select a date range or even just a date. But I keep getting empty dataframes. What am I not doing right?
All I want it to do is select that day's market data.
here is the relevant code:
def getdata(symbol,end_date,days):
start_date = end_date - datetime.timedelta(days= days)
return pdr.get_data_yahoo(symbol, start=start_date, end=end_date)
today = datetime.date.today()
date = today - datetime.timedelta(days=3)
df = getdata("MLM",today,375)
print(date)
df2 = df.loc[df.index == date]
print(df2)
When reading a csv file, the date column is set as month name (Jul-20 for July 2020), and when using parse_dates=True, Pandas converts it to 01-07-2020. How can I force pandas to convert it to end of month (ie, 31-07-2020)
Thanks
try using monthend from pandas.tseries.offsets
from pandas.tseries.offsets import MonthEnd
import pandas as pd
print(df)
month
0 2020-07-01
1 2020-08-02
df['month_end'] = df['month'] + MonthEnd(1)
print(df)
month month_end
0 2020-07-01 2020-07-31
1 2020-08-02 2020-08-31
You can use the inbuilt calendar and datetime modules and write your own apply method to achieve the desired result.
import calendar
import datetime
import pandas as pd
def parse_my_date(date):
date = datetime.datetime.strptime(date, '%B-%Y')
last_day = calendar.monthrange(date.year, date.month)[1]
date += datetime.timedelta(days=last_day-1)
return date
df['date'] = df['date'].apply(lambda x: parse_my_date(x))
I have a dataframe called prices, with historical stocks prices for the following companies:
['APPLE', 'AMAZON', 'GOOGLE']
So far on, with the help of a friendly user, I was able to create a dataframe for each of this periods with the following code:
import pandas as pd
import numpy as np
from datetime import datetime, date
prices = pd.read_excel('database.xlsx')
companies=prices.columns
companies=list(companies)
del companies[0]
timestep = 250
prices_list = [prices[day:day + step] for day in range(len(prices) - step)]
Now, I need to evaluate the change in price for every period of 251 days (Price251/Price1; Price252/Price2; Price 253/Price and so on) for each one of the companies, and create a column for each one of them.
I would also like to put the column name dynamic, so I can replicate this to a much longer database.
So, I would get a dataframe similar to this:
open image here
Here you can find the dataframe head(3): Initial Dataframe
IIUC, try this:
def create_cols(df,num_dates):
for col in list(df)[1:]:
df['{}%'.format(col)] = - ((df['{}'.format(col)].shift(num_dates) - df['{}'.format(col)]) / df['{}'.format(col)].shift(num_dates)).shift(- num_dates)
return df
create_cols(prices,251)
you only would have to format the columns to percentages.
I want to convert the given date column into date, month and year format. Initially, there are 2 columns after conversion it would be 4 colums like
Country|Date|Month|Year
The given data frame is of the type
test=pd.DataFrame({'Date':['2014,1,1','2014,4,17'],'Country':['Denmark','Australia']})
Pandas has a to_datetime function.
import pandas as pd
df = pd.DataFrame({'Date':['2014,1,1','2014,4,17']})
df["Date"] = pd.to_datetime(df["Date"], format="%Y,%m,%d")
# If you want to save other datetime attributes as their own columns
# just pull them out assign them to their own columns
# df["Month"] = df["Date"].dt.month
# df["Year"] = df["Date"].dt.year