Webscraper Returning Blank Results (IDE issue ? )

Webscraper Returning Blank Results (IDE issue ? ) - python-3.x

I was assisted with below code for a webscraper by one of the very helpful chaps on here however - it has all of a sudden stopped returning results. It either returns blank set() or nothing at all.
Does the below work for you ? Need to know if it's an issue with my IDE as it doesn't make any sense for it to be working one minute then giving random results the next when no amends was made to the code.
from requests_html import HTMLSession
import requests
def get_source(url):
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
def scrape_google(query, start):
response = get_source(f"https://www.google.co.uk/search?q={query}&start={start}")
links = list(response.html.absolute_links)
google_domains = ('https://www.google.',
'https://google.',
'https://webcache.googleusercontent.',
'http://webcache.googleusercontent.',
'https://policies.google.',
'https://support.google.',
'https://maps.google.')
for url in links[:]:
if url.startswith(google_domains):
links.remove(url)
return links
data = []
for i in range(3):
data.extend(scrape_google('best place', i * 10))
print(set(data))

Related

BS4 Script not consistently scraping traget value, not generating error

A while back I created a BS4 script to scrape off individual stock ticker market values from Yahoo Finance. The purpose being to update a personal finance program (individual use not commercial).
The program worked flawlessly for months, but recently it stopped working 100%. It appears to have a 25-50% success rate. The errors that do generate from the script are associated with the fact that a value was not obtained. I can not figure out how to generate an error as to why a value wasn't found/scraped for one execution but not another execution of the same script.
like each time I run the script it will work sometimes but not other times. I have adjusted the script to just execute on a single user input ticker instead of pulling a list from a database. Any thoughts as to where I am going wrong?
An attempt at debugging was the addition of print(soup). the idea was to ensure something was being obtained. Which it appears to be doing. However, the soup.find_All() aspect seems to be the point of random success.
[as an aside I may find an api to switch to in the future but for educational and proof of concept I want to get this to work.]
from bs4 import BeautifulSoup
import ssl
import os
import time
from urllib.request import Request, urlopen
def scrape_value(ticker):
ticker_price = ""
# For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
print(f"ticker: {ticker} - before scrape")
url = f'https://finance.yahoo.com/quote/{ticker.upper()}?p={ticker.upper()}&.tsrc=fin-srch'
req = Request(url, headers={'User-Agent': 'Chrome/79.0.3945.130'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, 'html.parser')
print(soup)
for span in soup.find_All('span',
attrs={'class': "Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)"
}):
ticker_price = span.text.strip()
print(ticker_price)
return ticker_price
if __name__ == '__main__':
scrape_value('F')

Use different approach, how to select the element. Observing the HTML source, the quote is in <span> with data-reactid="32":
import requests
from bs4 import BeautifulSoup
def scrape_value(ticker):
url = f'https://finance.yahoo.com/quote/{ticker.upper()}?p={ticker.upper()}&.tsrc=fin-srch'
webpage = requests.get(url, headers={'User-Agent': 'Chrome/79.0.3945.130'}).text
soup = BeautifulSoup(webpage, 'html.parser')
ticker_price = soup.find('span', {'data-reactid': "32"}).text
return ticker_price
if __name__ == '__main__':
for ticker in ['F', 'AAPL', 'MSFT']:
print('{} - {}'.format(ticker, scrape_value(ticker)))
Prints:
F - 6.90
AAPL - 369.44
MSFT - 201.57

Sending GET requests to amazon.in but the webserver responded with response code 503, what to do?

Here is my code:
This whole script worked fine for the first 2-3 times but now is constantly sending 503 responses
The Internet was checked by me multiple times but there wasn't any problem with internet
from bs4 import BeautifulSoup
import requests, sys, os, json
def get_amazon_search_page(search):
search = search.strip().replace(" ", "+")
for i in range(3): # tries to connect and get request the amazon 3 times
try:
print("Searching...")
response = requests.get("https://www.amazon.in/s?k={}&ref=nb_sb_noss".format(search)) # search string will be manipulated by replacing all spaces with "+" in order to search from the website itself
print(response.status_code)
if response.status_code == 200:
return response.content, search
except Exception:
pass
print("Is the search valid for the site: https://www.amazon.in/s?k={}&ref=nb_sb_noss".format(search))
sys.exit(1)
def get_items_from_page(page_content):
print(page_content)
soup = BeautifulSoup(page_content, "html.parser") # soup for extracting information
items = soup.find_all("span", class_ = "a-size-medium a-color-base a-text-normal")
prices = soup.find_all("span", class_ = "a-price-whole")
item_list = []
total_price_of_all = 0
for item, price in zip(items, prices):
dict = {}
dict["Name"] = item.text
dict["Price"] = int(price.text)
total_price_of_all += int(price.text.replace(",", ""))
item_list.append(dict)
average_price = total_price_of_all/len(item_list)
file = open("items.json", "w")
json.dump(item_list, file, indent = 4)
print("Your search results are available in the items.json file")
print("Average prices for the search: {}".format(average_price))
file.close()
def main():
os.system("clear")
print("Note: Sometimes amazon site misbehaves by sending 503 responses, this can be due to heavy traffic on that site, please cooperate\n\n")
search = input("Enter product name: ").strip()
page_content = get_amazon_search_page(search)
get_items_from_page(page_content)
if __name__ == "__main__":
while True:
main()
Please Help !

The server blocks you from scraping it.
If you check the robots.txt, you can see that the link you are trying to request is disallowed:
Disallow: */s?k=*&rh=n*p_*p_*p_
However, a simple way to bypass this blocking would be to change your User-Agent (see here). By default, requests sends something like this "python-requests/2.22.0". Changing it to something more browser-like would temporarily work.

Python urllib.request shows RemoteDisconnected error with OpenElevation API

I have an application (written in PyQt5) that returns x, y, and elevation of a location. When the user fills up the x, y, and hits getz button, the app calls the function below:
def getz(self, i):
"""calculates the elevation"""
import urllib
url = "https://api.open-elevation.com/api/v1/lookup"
x = self.lineEditX.text()
y = self.lineEditY.text()
url = url + "\?locations\={},{}".format(x, y)
print(url)
if i is "pushButtonSiteZ":
response = urllib.request.Request(url)
fp= urllib.request.urlopen(response)
print('response is '+ response)
self.lineEditSiteZ.setText(fp)
according to Open Elevation guide, it says that you have to make requests in the form of:
curl https://api.open-elevation.com/api/v1/lookup\?locations\=50.3354,10.4567
in order to get elevation data as a JSON object. But in my case it returns an error saying:
raise RemoteDisconnected("Remote end closed connection without"
RemoteDisconnected: Remote end closed connection without response
and nothing happens. How can I fix this?

There is no other way than to create a loop (try until the response is ok). Because the Open Elevation API's handling of so many responses is still problematic. But the following piece of code works after a possibly long delay:
def getz(self, i):
import json
import requests
url = "https://api.open-elevation.com/api/v1/lookup"
"""calculates the elevation"""
if i is 'pushButtonSiteZ':
x = self.lineEditSiteX.text()
y = self.lineEditSiteY.text()
param = url + '?locations={},{}'.format(x,y)
print(param)
while True:
try:
response = requests.get(param)
print(response.status_code)
if str(response.status_code) == '200':
r = response.text
r = json.loads(r)
out = r['results'][0]['elevation']
print(out)
self.lineEditSiteZ.setText(str(out))
cal_rng(self)
break
except ConnectionError:
continue
except json.decoder.JSONDecodeError:
continue
except KeyboardInterrupt:
continue
except requests.exceptions.SSLError:
continue
except requests.exceptions.ConnectionError:
continue

Trying to scrape multiple pages sequentially using loop to change url

First off, sorry... I am sure that this is a common problem but i did not find the solution anywhere eventhough i searched for a while.
I am trying to create list by scraping data from classicdb. The two problems i have is.
The scraping as written in the try loop does not work in inside the for loop but on its own it works. Currently it just returns the 0 even though there should be values to return.
The output that i get from the try loop gernerates new lists but i want to just get the value and append it later.
I have tried the try function outside the for loop and there it worked.
I also saw some solutions where a while true was used but that did not work for me.
from lxml.html import fromstring
import requests
import traceback
import time
from bs4 import BeautifulSoup as bs
Item_name=[]
Sell_Copper=[]
items= [47, 48]
url = 'https://classic.wowhead.com/item='
fails=[]
for i in items:
time.sleep(5)
url1=(url+str(i))
session = requests.session()
response = session.get(url1)
soup = bs(response.content, 'lxml')
name=soup.select_one('h1').text
print(name)
#get the buy prices
try:
copper = soup.select_one('li:contains("Sells for") .moneycopper').text
except Exception as e:
copper=str(0)
The expected result would be that i get one value in gold and a list in P_Gold. In this case:
copper='1'
Sell_copper=['1','1']

You don't need a sleep. It needs to be div:contains and the search text needs changing
import requests
from bs4 import BeautifulSoup as bs
Item_name=[]
Sell_Copper=[]
items= [47, 48]
url = 'https://classic.wowhead.com/item='
fails=[]
with requests.Session() as s:
for i in items:
response = s.get(url + str(i))
soup = bs(response.content, 'lxml')
name = soup.select_one('h1').text
print(name)
try:
copper = soup.select_one('div:contains("Sell Price") .moneycopper').text
except Exception as e:
copper=str(0)
print(copper)

Checking if robots.txt file exists in python3

I want to check a URL for the existence of robots.txt file. I found out about urllib.robotparser in python 3 and tried getting the response. But I can't find a way to return the status code (or just true/false existance) of robotss.txt
from urllib import parse
from urllib import robotparser
def get_url_status_code():
URL_BASE = 'https://google.com/'
parser = robotparser.RobotFileParser()
parser.set_url(parse.urljoin(URL_BASE, 'robots.txt'))
parser.read()
# I want to return the status code
print(get_url_status_code())

This isn't too hard to do if you're okay using the requests module which is highly recommended
import requests
def status_code(url):
r = requests.get(url)
return r.status_code
print(status_code('https://github.com/robots.txt'))
print(status_code('https://doesnotexist.com/robots.txt'))
Otherwise, if you want to avoid using a GET request, you could use a HEAD.
def does_url_exist(url):
return requests.head(url).status_code < 400
Better yet,
def does_url_exist(url):
try:
r = requests.head(url)
if r.status_code < 400:
return True
else:
return False
except requests.exceptions.RequestException as e:
print(e)
# handle your exception

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Webscraper Returning Blank Results (IDE issue ? ) - python-3.x

Related

BS4 Script not consistently scraping traget value, not generating error

Sending GET requests to amazon.in but the webserver responded with response code 503, what to do?

Python urllib.request shows RemoteDisconnected error with OpenElevation API

Trying to scrape multiple pages sequentially using loop to change url

Checking if robots.txt file exists in python3

Categories

Resources