How do you pull weekly/monthly historical data from yahoo finance? - python-3.x

By default, this code obtains all the closing days data or several Tickers:
tickers = ['SPY', 'QQQ', 'GLD ', 'EEM', 'IEMG', 'VTI', 'HYG', 'SJNK', 'USO']
ind_data = pd.DataFrame()
for t in tickers:
ind_data[t] = wb.DataReader(t,data_source='yahoo', start='2015-1-1')['Adj Close']
ind_data.to_excel('C:/Users/micka/Desktop/ETF.xlsx')
How do you add a parameter to Datareader in order to obtain weekly/monthly historical data instead? I tried using freq and interval but it doesn't work.

What if you try to replace this in your code for weakly data:
# Don't forget to import pandas_datareader exactly in this way
import pandas_datareader
# Then replace this in for loop
pandas_datareader.yahoo.daily.YahooDailyReader(t, interval='w' , start='2015-1-1').read()['Adj Close']
And this for monthly data:
# Don't forget to import pandas_datareader exactly in this way
import pandas_datareader
# Then replace this in for loop
pandas_datareader.yahoo.daily.YahooDailyReader(t, interval='m' , start='2015-1-1').read()['Adj Close']
Check official doc for further options. After executing this code for the weakly period:
import pandas_datareader
tickers = ['SPY', 'QQQ', 'GLD ', 'EEM', 'IEMG', 'VTI', 'HYG', 'SJNK', 'USO']
ind_data = pd.DataFrame()
for t in tickers:
ind_data[t] = pandas_datareader.yahoo.daily.YahooDailyReader(t, interval='w' , start='2015-1-1').read()['Adj Close']
ind_data
I got this:

Related

Issue webscraping a linked header with Beautiful Soup

I am running into an issue pulling in the human readable header name from this table in an html document. I can pull in the id, but my trouble comes when trying to pull in the correct header between the '>... I am not sure what I need to do in this instance... Below is my code. It all runs except for the last for loop.
# Import libraries
import requests
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd
import numpy as np
# Pull the HTML link into a local file or buffer
# and then parse with the BeautifulSoup library
# ------------------------------------------------
url = 'https://web.dsa.missouri.edu/static/mirror_sites/factfinder.census.gov/bkmk/table/1.0/en/GEP/2014/00A4/0100000US.html'
r = requests.get(url)
#print('Status: ' + str(r.status_code))
#print(requests.status_codes._codes[200])
soup = BeautifulSoup(r.content, "html")
table = soup.find(id='data')
#print(table)
# Convert the data into a list of dictionaries
# or some other structure you can convert into
# pandas Data Frame
# ------------------------------------------------
trs = table.find_all('tr')
#print(trs)
header_row = trs[0]
#print(header_row)
names = []
for column in header_row.find_all('th'):
names.append(column.attrs['id'])
#print(names)
db_names = []
for column in header_row.find_all('a'):
db_names.append(column.attrs['data-vo-id']) # ISSUE ARISES HERE!!!
print(db_names)
Let pandas read_html do the work for you, and simply specify the table id to find:
from pandas import read_html as rh
table = rh('https://web.dsa.missouri.edu/static/mirror_sites/factfinder.census.gov/bkmk/table/1.0/en/GEP/2014/00A4/0100000US.html', attrs = {'id': 'data'})[0]
Hey you can try something like this :
soup = BeautifulSoup(r.content, "html")
table = soup.findAll('table', {'id':'data'})
trs = table[0].find_all('tr')
#print(trs)
names = []
for row in trs[:1]:
td = row.find_all('td')
data_row_txt_list = [td_tag.text.strip() for td_tag in row]
header_row = data_row_txt_list
for column in header_row:
names.append(column)

'For' loop for reading multiple csv files from a google storage bucket into 1 Pandas DataFrame

I currently have 31 .csv files (all with the same identical structure - 60 cols wide and about 5000 rows deep) that I'm trying to read in from a google storage bucket into 1 pandas dataframe using a 'FOR' loop and I keep getting a 'timeout' error after 6 mins.
Upon doing some testing, I have noticed that I'm able to read one .csv file a time through it, but once I introduce 2 or more, I get the timeout error. This makes me think that my code is the problem rather than the size of the data.
Code is below (Should I be using pd.concat at any stage in the for loop?) help would be appreciated
def stage1eposdata(data, context):
from google.cloud import storage
from google.cloud import bigquery
import pandas as pd
import dask.dataframe as dd
import io
import numpy as np
import datetime as dt
from googleapiclient import discovery
from pandas.io.json import json_normalize
import google.auth
import math
destination_path1 = 'gs://staged_data/ddf-*_stet.csv'
## Source Buckets #
raw_epos_bucket = 'raw_data'
cleaned_epos_bucket = 'staged_data'
# Confirming Oauth #
storage_client = storage.Client()
bigquery_client = bigquery.Client()
# Confirming Connection #
raw_epos_data = storage_client.bucket(raw_epos_bucket)
cleaned_epos_data = storage_client.bucket(cleaned_epos_bucket)
df = pd.DataFrame()
for file in list(raw_epos_data.list_blobs(prefix='2019/')):
file_path="gs://{}/{}".format(file.bucket.name, file.name)
df = df.append(pd.read_csv(file_path),sort =False)
ddf = dd.from_pandas(df,npartitions=1, sort=True)
ddf.to_csv(destination_path1, index=True, sep=',')
Try this:
## Source Buckets #
raw_epos_bucket = 'raw_data'
cleaned_epos_bucket = 'staged_data'
# Confirming Oauth #
storage_client = storage.Client()
bigquery_client = bigquery.Client()
# Confirming Connection #
raw_epos_data = storage_client.bucket(raw_epos_bucket)
cleaned_epos_data = storage_client.bucket(cleaned_epos_bucket)
my_dataframe_list=[]
for file in list(raw_epos_data.list_blobs(prefix='2019/')):
file_path="gs://{}/{}".format(file.bucket.name, file.name)
my_dataframe_list.append(pd.read_csv(file_path))
df=pd.concat(my_dataframe_list)
ddf = dd.from_pandas(df,npartitions=1, sort=True)
ddf.to_csv(destination_path1, index=True, sep=',')
pd.concat joins a list of DataFrame. So in each iteration of the loop you keep the dataframe in the list my_dataframe_list and out of the loop concatenate the list.
if the columns match it should work.
It turns out that dask can do this type of thing very well due to its 'lazy' computation feature. My solution is below
## Source Buckets #
raw_epos_bucket = 'raw_data'
cleaned_epos_bucket = 'staged_data'
# Confirming Oauth #
storage_client = storage.Client()
bigquery_client = bigquery.Client()
# Confirming Connection #
raw_epos_data = storage_client.bucket(raw_epos_bucket)
cleaned_epos_data = storage_client.bucket(cleaned_epos_bucket)
my_dataframe_list = []
my_dataframe_list = dd.read_csv('gs://raw_data/*.csv')# '*' is wild card no need to do any more 'For' Loops!
ddf = dd.from_pandas(df,npartitions=1, sort=True)
ddf.to_csv(destination_path1, index=True, sep=',')

insert using pandas to_sql() missing data into clickhouse db

It's my first time using sqlalchemy and pandas to insert some data into a clickhouse db.
When I try to insert some data using clickhouse cli it works fine, but when I tried to do the same thing using sqlalchemy I don't know why one row is missing.
Have I done something wrong?
import pandas as pd
# created the dataframe
engine = create_engine(uri)
session = make_session(engine)
metadata = MetaData(bind=engine)
metadata.reflect(bind = engine)
conn = engine.connect()
df.to_sql('test', conn, if_exists = 'append', index = False)
Let's try this way:
import pandas as pd
from infi.clickhouse_orm.engines import Memory
from infi.clickhouse_orm.fields import UInt16Field, StringField
from infi.clickhouse_orm.models import Model
from sqlalchemy import create_engine
# define the ClickHouse table schema
class Test_Humans(Model):
year = UInt16Field()
first_name = StringField()
engine = Memory()
engine = create_engine('clickhouse://default:#localhost/test')
# create table
with engine.connect() as conn:
conn.connection.create_table(Test_Humans) # https://github.com/Infinidat/infi.clickhouse_orm/blob/master/src/infi/clickhouse_orm/database.py#L142
pdf = pd.DataFrame.from_records([
{'year': 1994, 'first_name': 'Vova'},
{'year': 1995, 'first_name': 'Anja'},
{'year': 1996, 'first_name': 'Vasja'},
{'year': 1997, 'first_name': 'Petja'},
# ! sqlalchemy-clickhouse ignores the last item so add fake one
{}
])
pdf.to_sql('test_humans', engine, if_exists='append', index=False)
Take into account that sqlalchemy-clickhouse ignores the last item so add fake one (see source code and related issue 10).

Dash python plotly live update table

I am new to plotly dash. I want to draw a table whose values (Rows) will
automatically be updated after certain interval of time but i do not know how
to use dash table experiments. The table is already saved as CSV file but i
am somehow unable make it live.
Please help!
Can some one guide me in the right direction what should i do
Your help will be highly appreciated. Following is the code.
import dash
import pandas as pd
from pandas import Series, DataFrame
from dash.dependencies import Input, Output, Event
import dash_core_components as dcc
import dash_html_components as html
import dash_table_experiments as dtable
app=dash.Dash()
def TP_Sort():
address = 'E:/Dats Science/POWER BI LAB DATA/PS CORE KPIS/Excel Sheets/Throughput.xlsx'
TP = pd.read_excel(address)
TP1=TP.head()
Current_Interval.to_csv('TP1.csv', index=False)
return app.layout = html.Div([
html.H1('Data Throughput Dashboard-NOC NPM Core'),
dcc.Interval(id='graph-update',interval=240000),
dtable.DataTable(id='my-table',
rows=[{}],
row_selectable=False,
filterable=True,
sortable=False,
editable=False)
])
#app.callback(
dash.dependencies.Output('my-table','row_update'),
events=[dash.dependencies.Event('graph-update', 'interval')])
def update_table(maxrows=4):
TP_Sort()
TP_Table1='C:/Users/muzamal.pervez/Desktop/Python Scripts/TP1.csv'
TP_Table2=pd.read_csv(TP_Table1)
return TP_Table2.to_dict('records')
if __name__ == '__main__':
app.run_server(debug=False)
I am trying the above approach. Please correct me where i am wrong as the output is error loading dependencies.
BR
Rana
Your callback is wrong.
It should be:
#app.callback(Output('my-table', 'rows'), [Input('graph-update', 'n_intervals')])
def update_table(n, maxrows=4):
# We're now in interval *n*
# Your code
return TP_Table2.to_dict('records')

Scraping $ parsing multiple html tables with the same URL (beautifulsoup & selenium)

i'm trying to scrape the full HTML table from this site:
https://www.iscc-system.org/certificates/all-certificates/
My code is as follows:
from selenium import webdriver
import time
import pandas as pd
url = 'https://www.iscc-system.org/certificates/all-certificates/'
browser = webdriver.Chrome('/home/giuseppe/bin/chromedriver')
browser.get(url)
csvfile = open('Scrape_certificates', 'a')
dfs = pd.read_html('https://www.iscc-system.org/certificates/all-certificates/', header=0)
for i in range(1,10):
for df in dfs:
df.to_csv(csvfile, header=False)
link_next_page = browser.find_element_by_id('table_1_next')
link_next_page.click()
time.sleep(4)
dfs = pd.read_html(browser.current_url)
csvfile.close()
The above code is only for the first 10 pages of the full table as an example.
The problem is that the output is always the same first table repeated 10 times, although by clicking the 'next table' button the actual table gets updated (at least if I see the webpage), I'm unable to get the real new data from the following table. I get always the same data from the first table.
Firstly you are reading the URL with pandas not the page source. This will fetch the page new not read the Selenium generated source. Secondly you want to limit the reading to the table with the id = table_1. Try this:
from selenium import webdriver
import time
import pandas as pd
url = 'https://www.iscc-system.org/certificates/all-certificates/'
browser = webdriver.Chrome('/home/giuseppe/bin/chromedriver')
browser.get(url)
csvfile = open('Scrape_certificates', 'a')
for i in range(1,10):
dfs = pd.read_html(browser.page_source, attrs = {'id': 'table_1'})
for df in dfs:
df.to_csv(csvfile, header=False)
link_next_page = browser.find_element_by_id('table_1_next')
link_next_page.click()
time.sleep(4)
csvfile.close()
You will need to remove or filter out line 10 from each result as it is the navigation.

Resources