Filtering yfinance data - python-3.x

I am trying to test stock algorithms using historical data, I want to be able to select a date range or even just a date. But I keep getting empty dataframes. What am I not doing right?
All I want it to do is select that day's market data.
here is the relevant code:
def getdata(symbol,end_date,days):
start_date = end_date - datetime.timedelta(days= days)
return pdr.get_data_yahoo(symbol, start=start_date, end=end_date)
today = datetime.date.today()
date = today - datetime.timedelta(days=3)
df = getdata("MLM",today,375)
print(date)
df2 = df.loc[df.index == date]
print(df2)

Related

Pandas Dataframe Query - Location of highest value per row

I following code generates a small dataframe that is intended to be a fictitious Olympics medal table.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 47, 20).reshape(4,5),
index = ['USA', 'USR', 'ITL', 'GBR'],
columns=[1996, 2000, 2004, 2008, 2102])
df['Highest'] = df.max(axis=1).round()
df = df.sort_values('Highest', ascending = False).head(10)
df
I have added a column at the end to establish the highest medal tally per row (Country).
I need to add an additional 'Year' column that adds the year in which the highest medal tally was won for each row.
So, if the highest number of medals on row 1 was won in the year 2012, the value of 2012 should be added in row 1 of the new 'Year' column.
How can I do that?
Thanks
Here's one option to find the index location, then find the Year. You can adapt for your purpose as needed. Create random df first.
Using .index gives a list; in this case the list is one element at the max, so use [0] to get the value from the list
Then use .at to get the year at the max value.
df = pd.DataFrame(data={'Year': range(2000, 2010), 'Value': np.random.uniform(low=0.5, high=13.3, size=(10,))}, columns=['Year', 'Value'])
max_value = df.Value.max()
idx_max_value = df.loc[df.Value == max_value].index[0]
year_at_max_value = df.at[idx_max_value,'Year']
Probably not the most Pythonic solution, but this works:
year = []
for x in range(len(df)):
pip = np.array(df.iloc[x, :5])
i = np.argmax(pip)
year.append(df.columns[i])
df['Year'] = year

How to get data in pandas dataframe in a range of date

I have a front end where my clients select a date period like
date_start = 2020/01/03
date_end = 2020/03/10
I have a data frame that has 1975 lines and 4 columns, including Date, like:
Date|Tax|Values|Total
I need to get all columns in that to be in a period between date_start and date_end on Pandas Dataframe. How Can I get it?
What I tried:
Try to do it with code:
new_df= df[(df['Date'] >= date_start) & (df['Date'] <= date_end)]
But the return was wrong.
welcome.
Keep in mind you're not filtering for those dates but selecting the dates in between.
Try the following:
# To make sure your column is in datetime format
df['Date'] = pd.to_datetime(df['Date'])
new_df = df.loc[(df['Date']>=date_start) & (df['Date']<=date_end)]
Example:
from_date = "2021-08-27"
to_date = "2021-08-31"
inclusive can be {“both”, “neither”, “left”, “right”}
df[df['date'].between(from_date, to_date, inclusive='both')]
This function is equivalent to:
df[(from_date <= df['date']) & (df['date'] <= to_date)]

How to replace the null valued column with the values store in the list with the corresponding indexes in CSV using python and pandas?

Check if the value of the cell in some specific row of CLOSE DATE is blank then proceed with formul of adding 3 days to the SOLVED DATE and update the value of the cell
I'm using pandas library and jupyter Notebook as my text editor.
d is the object of my csv file
for index, row in d.iterrows():
startdate = row["SOLVED DATE"]
print(index, startdate)
enddate = pd.to_datetime(startdate) + pd.DateOffset(days=3)
row["CLOSE DATE"]=enddate
#d.iloc[index,10]=enddate
l1.append(enddate)
L1 is the list that contains the values in datetime format
and i need to replace the values of the column named "CLOSE DATE" with the values of the L1 and update my csv file accordingly
Welcome to the Stackoverflow Community!
Iterrows() is usually a slow method and should be avoided in most cases. There are a few ways we can do your task.
Making Two Dataframes = Null DF & Not Null DF and imputing values in the Null DF then merging the two
Imputing values in the Null Df itself.
As a supplementary on the logic of adding the updated date column. It is as follows.
Let's first take the "SOLVED DATE" and store it in a new series,
let's call it "new_date".
Let's Modify the "new_date" by adding 3 days.
Once done set this "new_date" as the value of the column you want to be updated.
In terms of code
# 1st Method
import pandas as pd
null = d.loc[d['CLOSE DATE'].isna() == True]
not_null = d.loc[d['CLOSE DATE'].isna() != True]
new_date = null['SOLVED DATE]
new_date = pd.to_datetime(new_date) + pd.DateOffset(days=3)
null['CLOSE DATE'] = new_date
d = pd.concat([null not_null], axis = 0)
d = d.reset_index(drop = True)
# 2nd Method
import pandas as pd
new_date = d.loc[d['CLOSE DATE'].isna() == True,'SOLVED DATE]
new_date = pd.to_datetime(new_date) + pd.DateOffset(days=3)
d['CLOSE DATE'] = d['CLOSE DATE'].fillna(new_date)

regarding selected certain rows based on a given requirements into another dataframe

I have read the csv file into a dataframe using Pandas, the csv format is as follows. I would like to put the rows whose “time column information” is between the interval of 6/3/2011-10/20/2011 into another dataframe. How can I do it efficiently in Pandas?
Try this method:
data_frame['time'] = pd.to_datetime(data_frame['time'])
select_rows = (data_frame['time'] > start_date) & (data_frame['time'] <= end_date)
data_frame.loc[select_rows]
Or, you can make time column date time index and then select rows based on that as well.
I think you need to_datetime first and then filter by between with boolean indexing:
df['time'] = pd.to_datetime(df['time'], format='%m/%d/%Y')
df1 = df[df['time'].between('2011-06-03','2011-10-20')]
Create DatetimeIndex and select by loc:
df['time'] = pd.to_datetime(df['time'], format='%m/%d/%Y')
df = df.set_index('time')
df1 = df.loc['2011-06-03':'2011-10-20']

sales insights automatically weekly/daily

I have sales data from Jan 2014 until last week and data will refresh everyday.
I want to generate some insights automatically to compare to the latest week, for example how much sales decreased/increased from last week to this week and which is the hot product etc.
I am confused with how to store latest week dynamically
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Product': ['EyeWear', 'Packs', 'Watches', 'Irons', 'Glasses'],
'Country':['USA','India','Africa','UK','India'],
'Revenue':[98,90,87,69,78],
'Date':['20140101','20140102','20140103','20140104','20140105']},
index=[1,2,3,4,5])
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df['year'] = df['Date'].dt.year
df['month'] = df['Date'].dt.month
df['week'] = df['Date'].dt.week
df['YearMonth'] = df['Date'].apply(lambda x:x.strftime('%Y%m'))

Resources