Scraping values using Python - python-3.x

I want to scrape "Short interest" and "% of float shorted" values from https://www.marketwatch.com/investing/stock/aapl. I have tried using xpath, but it returns an Error.
import requests
from lxml import html
import pandas as pd
import numpy as np
url = "https://www.marketwatch.com/investing/stock/aapl"
page = requests.get(url)
tree = html.fromstring(page.content)
per_of_float_shorted = tree.xpath('/html/body/div[4]/div[5]/div[1]/div[1]/div/ul/li[14]/span[1]')[0].text()
short_interest = tree.xpath('/html/body/div[4]/div[5]/div[1]/div[1]/div/ul/li[14]/span[1]')[0].text()
print(short_interest)
print(per_of_float_shorted)```
Traceback (most recent call last):
File "scraper.py", line 11, in <module>
per_of_float_shorted = tree.xpath('/html/body/div[4]/div[5]/div[1]/div[1]/div/ul/li[14]/span[1]')[0].text()
IndexError: list index out of range
Thank you.

Related

deleting pandas dataframe rows not working

import numpy as np
import pandas as pd
randArr = np.random.randint(0,100,20).reshape(5,4)
df =pd.DataFrame(randArr,np.arange(101,106,1),['PDS', 'Algo','SE','INS'])
df.drop('103',inplace=True)
this code not working
Traceback (most recent call last):
File "D:\Education\4th year\1st sem\Machine Learning Lab\1st Lab\python\pandas\pdDataFrame.py", line 25, in <module>
df.drop('103',inplace=True)
The string '103' isnt in the index, but the integer 103 is:
Replace df.drop('103',inplace=True) with df.drop(103,inplace=True)

AttributeError: 'NoneType' object has no attribute (Google search results count)

In Python3 I try to get the search results count from Google for a random word, in this case 'test', with the following code:
import requests
from bs4 import BeautifulSoup
import argparse
parser = argparse.ArgumentParser(description='GoogleResultsCount')
parser.add_argument('text', help='text')
args = parser.parse_args()
r = requests.get('http://www.google.com/search',
params={'q':'"'+args.text+'"',
"tbs":"li:1"}
)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.find('div',{'id':'resultStats'}).text)
But whenever I try to run it from cmd, it says the following:
C:\Users\Peter\Desktop>gsc.py test
Traceback (most recent call last):
File "C:\Users\Peter\Desktop\GSC.py", line 32, in <module>
print(soup.find('div',{'id':'resultStats'}).text)
AttributeError: 'NoneType' object has no attribute 'text'
What am I missing here?

Iterate all ids and crawler table's elements as dataframe in Python

I want to click into each item for each page from this link, then crawler all infos from this page in red circle and save then as dataframe format.
In order to loop all pages, I need to iterate all Ids in the second link, but I don't know where I can get them.
https://www.pudong.gov.cn/shpd/InfoOpen/InfoDetail.aspx?Id=956244
I starts with the following code but I get a response of no table found which I can visually see from the source code:
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'https://www.pudong.gov.cn/shpd/InfoOpen/InfoDetail.aspx?Id=1044647'
website_url = requests.get(url, verify=False).text
soup = BeautifulSoup(website_url, 'html.parser')
table = soup.find('table', {'class': 'ewb-claim table'})
# https://stackoverflow.com/questions/51090632/python-excel-export
df = pd.read_html(str(table))[0]
print(df.head(5))
Ouput:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
Traceback (most recent call last):
File "<ipython-input-13-d95369a5d235>", line 12, in <module>
df = pd.read_html(str(table))[0]
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py", line 915, in read_html
keep_default_na=keep_default_na)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py", line 749, in _parse
raise_with_traceback(retained)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/compat/__init__.py", line 385, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No tables found
Someone could help? Thanks.

AMZN stock data retrieval with YahooDailyReader

Up until 30 minutes ago I was executing the following code without problems:
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2012,1,1)
end = datetime.datetime(2015,12,31)
AAPL = web.get_data_yahoo('AAPL', start, end)
AMZN = web.get_data_yahoo('AMZN', start, end)
Instead now I get:
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.7/site-packages/pandas_datareader/yahoo/daily.py", line 157, in _read_one_data
data = j["context"]["dispatcher"]["stores"]["HistoricalPriceStore"]
KeyError: 'HistoricalPriceStore'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/me/opt/anaconda3/lib/python3.7/site-packages/pandas_datareader/data.py", line 82, in get_data_yahoo
return YahooDailyReader(*args, **kwargs).read()
File "/Users/me/opt/anaconda3/lib/python3.7/site-packages/pandas_datareader/base.py", line 251, in read
df = self._read_one_data(self.url, params=self._get_params(self.symbols))
File "/Users/me/opt/anaconda3/lib/python3.7/site-packages/pandas_datareader/yahoo/daily.py", line 160, in _read_one_data
raise RemoteDataError(msg.format(symbol, self.__class__.__name__))
pandas_datareader._utils.RemoteDataError: No data fetched for symbol AMZN using YahooDailyReader
How can I fix this?
Is there a work around to get the AMZN data as DataFrame from another source (different from Yahoo_Daily_Reader)?
Python version 3.4.7
How about this solution?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.optimize as sco
import datetime as dt
import math
from datetime import datetime, timedelta
from pandas_datareader import data as wb
from sklearn.cluster import KMeans
np.random.seed(777)
start = '2019-4-30'
end = '2019-10-31'
# N = 90
# start = datetime.now() - timedelta(days=N)
# end = dt.datetime.today()
tickers = ['MMM',
'ABT',
'AAPL',
'AMAT',
'APTV',
'ADM',
'ARNC',
'AMZN']
thelen = len(tickers)
price_data = []
for ticker in tickers:
prices = wb.DataReader(ticker, start = start, end = end, data_source='yahoo')[['Adj Close']]
price_data.append(prices.assign(ticker=ticker)[['ticker', 'Adj Close']])
df = pd.concat(price_data)
df.dtypes
df.head()
df.shape
pd.set_option('display.max_columns', 500)
df = df.reset_index()
df = df.set_index('Date')
table = df.pivot(columns='ticker')
# By specifying col[1] in below list comprehension
# You can select the stock names under multi-level column
table.columns = [col[1] for col in table.columns]
table.head()
plt.figure(figsize=(14, 7))
for c in table.columns.values:
plt.plot(table.index, table[c], lw=3, alpha=0.8,label=c)
plt.legend(loc='upper left', fontsize=12)
plt.ylabel('price in $')

surprise.dataset is not a package

When I try to execute this code
import surprise.dataset
file = 'ml-100k/u.data'
col = 'user item rating timestamp'
reader = surprise.dataset.Reader(line_format = col, sep = '\t')
data = surprise.dataset.Dataset.load_from_file(file, reader = reader)
data.split(n_folds = 5)
I get this error:
Traceback (most recent call last):
File "/home/user/PycharmProjects/prueba2/surprise.py", line 1, in <module>
import surprise.dataset
File "/home/user/PycharmProjects/prueba2/surprise.py", line 1, in <module>
import surprise.dataset
ModuleNotFoundError: No module named 'surprise.dataset'; 'surprise' is not a package
But surprise is installed. How can I use it?
If you want to import a submodule, instead of doing:
import surprise.dataset
do:
from surprise import dataset
then in the rest of the code, reference it as dataset instead of surprise.dataset (i.e. replace surprise.dataset.Reader with dataset.Reader). If you don’t want to update the rest of your code, you can also just import the whole surprise module instead with:
import surprise

Resources