BS4 Script not consistently scraping traget value, not generating error - python-3.x

A while back I created a BS4 script to scrape off individual stock ticker market values from Yahoo Finance. The purpose being to update a personal finance program (individual use not commercial).
The program worked flawlessly for months, but recently it stopped working 100%. It appears to have a 25-50% success rate. The errors that do generate from the script are associated with the fact that a value was not obtained. I can not figure out how to generate an error as to why a value wasn't found/scraped for one execution but not another execution of the same script.
like each time I run the script it will work sometimes but not other times. I have adjusted the script to just execute on a single user input ticker instead of pulling a list from a database. Any thoughts as to where I am going wrong?
An attempt at debugging was the addition of print(soup). the idea was to ensure something was being obtained. Which it appears to be doing. However, the soup.find_All() aspect seems to be the point of random success.
[as an aside I may find an api to switch to in the future but for educational and proof of concept I want to get this to work.]
from bs4 import BeautifulSoup
import ssl
import os
import time
from urllib.request import Request, urlopen
def scrape_value(ticker):
ticker_price = ""
# For ignoring SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
print(f"ticker: {ticker} - before scrape")
url = f'https://finance.yahoo.com/quote/{ticker.upper()}?p={ticker.upper()}&.tsrc=fin-srch'
req = Request(url, headers={'User-Agent': 'Chrome/79.0.3945.130'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, 'html.parser')
print(soup)
for span in soup.find_All('span',
attrs={'class': "Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)"
}):
ticker_price = span.text.strip()
print(ticker_price)
return ticker_price
if __name__ == '__main__':
scrape_value('F')

Use different approach, how to select the element. Observing the HTML source, the quote is in <span> with data-reactid="32":
import requests
from bs4 import BeautifulSoup
def scrape_value(ticker):
url = f'https://finance.yahoo.com/quote/{ticker.upper()}?p={ticker.upper()}&.tsrc=fin-srch'
webpage = requests.get(url, headers={'User-Agent': 'Chrome/79.0.3945.130'}).text
soup = BeautifulSoup(webpage, 'html.parser')
ticker_price = soup.find('span', {'data-reactid': "32"}).text
return ticker_price
if __name__ == '__main__':
for ticker in ['F', 'AAPL', 'MSFT']:
print('{} - {}'.format(ticker, scrape_value(ticker)))
Prints:
F - 6.90
AAPL - 369.44
MSFT - 201.57

Related

Getting incorrect link on parsing web page in BeautifulSoup

I'm trying to get the download link from the button in this page. But when I open the download link that I get from my code I get this message
I noticed that if I manually click the button and open the link in a new page the csrfKey part of the link is always same whereas when I run the code I get a different key every time. Here's my code
from bs4 import BeautifulSoup
import requests
import re
def GetPage(link):
source_new = requests.get(link).text
soup_new = BeautifulSoup(source_new, 'lxml')
container_new = soup_new.find_all(class_='ipsButton')
for data_new in container_new:
#print(data_new)
headline = data_new # Display text
match = re.findall('download', str(data_new), re.IGNORECASE)
if(match):
print(f'{headline["href"]}\n')
if __name__ == '__main__':
link = 'https://eci.gov.in/files/file/10985-5-number-and-types-of-constituencies/'
GetPage(link)
Before you get to the actual download links of the files, you need to agree to Terms and Conditions. So, you need to fake this with requests and then parse the next page you get.
Here's how:
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
link = 'https://eci.gov.in/files/file/10985-5-number-and-types-of-constituencies/'
with requests.Session() as connection:
r = connection.get("https://eci.gov.in/")
confirmation_url = BeautifulSoup(
connection.get(link).text, 'lxml'
).select_one(".ipsApp .ipsButton_fullWidth")["href"]
fake_agree_to_continue = connection.get(
confirmation_url.replace("?do=download", "?do=download&confirm=1")
).text
download_links = [
a["href"] for a in
BeautifulSoup(
fake_agree_to_continue, "lxml"
).select(".ipsApp .ipsButton_small")[1:]]
for download_link in download_links:
response = connection.get(download_link)
file_name = (
response
.headers["Content-Disposition"]
.replace('"', "")
.split(" - ")[-1]
)
print(f"Downloading: {file_name}")
with open(file_name, "wb") as f:
f.write(response.content)
This should output:
Downloading: Number And Types Of Constituencies.pdf
Downloading: Number And Types Of Constituencies.xls
And save two files: a .pdf and a .xls. The later one looks like this:

How to scrape multiple pages with requests in python

recently started getting into web scraping and i have managed ok but now im stuck and i cant find the answer or figure it out.
Here is my code for scraping and exporting info from a single page
import requests
page = requests.get("https://www.example.com/page.aspx?sign=1")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
print (heading, reading)
import csv
from datetime import datetime
# open a csv file with append, so old data will not be erased
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
writer.writerow([heading, reading, datetime.now()])
Problem occurs when i try to scrape multiple pages at the same time.
They are all the same just pagination changes eg
https://www.example.com/page.aspx?sign=1
https://www.example.com/page.aspx?sign=2
https://www.example.com/page.aspx?sign=3
https://www.example.com/page.aspx?sign=4 etc
Instead of writing the same code 20 times how do i stick all the data in a tuple or an array and export to csv.
Many thanks in advance.
Just try it out with a loop, until you got no page available (request is not OK). Should be easy to get.
import requests
from bs4 import BeautifulSoup
import csv
from datetime import datetime
results = []
page_number = 1
while True:
response = requests.get(f"https://www.example.com/page.aspx?sign={page_number}")
if response.status_code != 200:
break
soup = BeautifulSoup(page.content, 'html.parser')
#finds the right heading to grab
box = soup.find('h1').text
heading = box.split()[0]
#finds the right paragraph to grab
reading = soup.find_all('p')[0].text
# write a list
# results.append([heading, reading, datetime.now()])
# or tuple.. your call
results.append((heading, reading, datetime.now()))
page_number = page_number + 1
with open('index.csv', 'a') as csv_file:
writer = csv.writer(csv_file)
for result in results:
writer.writerow(result)

Trying to scrape multiple pages sequentially using loop to change url

First off, sorry... I am sure that this is a common problem but i did not find the solution anywhere eventhough i searched for a while.
I am trying to create list by scraping data from classicdb. The two problems i have is.
The scraping as written in the try loop does not work in inside the for loop but on its own it works. Currently it just returns the 0 even though there should be values to return.
The output that i get from the try loop gernerates new lists but i want to just get the value and append it later.
I have tried the try function outside the for loop and there it worked.
I also saw some solutions where a while true was used but that did not work for me.
from lxml.html import fromstring
import requests
import traceback
import time
from bs4 import BeautifulSoup as bs
Item_name=[]
Sell_Copper=[]
items= [47, 48]
url = 'https://classic.wowhead.com/item='
fails=[]
for i in items:
time.sleep(5)
url1=(url+str(i))
session = requests.session()
response = session.get(url1)
soup = bs(response.content, 'lxml')
name=soup.select_one('h1').text
print(name)
#get the buy prices
try:
copper = soup.select_one('li:contains("Sells for") .moneycopper').text
except Exception as e:
copper=str(0)
The expected result would be that i get one value in gold and a list in P_Gold. In this case:
copper='1'
Sell_copper=['1','1']
You don't need a sleep. It needs to be div:contains and the search text needs changing
import requests
from bs4 import BeautifulSoup as bs
Item_name=[]
Sell_Copper=[]
items= [47, 48]
url = 'https://classic.wowhead.com/item='
fails=[]
with requests.Session() as s:
for i in items:
response = s.get(url + str(i))
soup = bs(response.content, 'lxml')
name = soup.select_one('h1').text
print(name)
try:
copper = soup.select_one('div:contains("Sell Price") .moneycopper').text
except Exception as e:
copper=str(0)
print(copper)

Get the name of Instagram profile and the date of post with Python

I'm in the process of learning python3 and I try to solve a simple task. I want to get the name of account and the date of post from instagram link.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.instagram.com/p/BuPSnoTlvTR')
soup = BeautifulSoup(html.text, 'lxml')
item = soup.select_one("meta[property='og:description']")
name = item.find_previous_sibling().get("content").split("•")[0]
print(name)
This code works sometimes with links like this https://www.instagram.com/kingtop
But I need it to work also with post of image like this https://www.instagram.com/p/BuxB00KFI-x/
That's all what I could make, but this is not working. And I can't get the date also.
Do you have any ideas? I appreciate any help.
I found a way to get the name of account. Now I'm trying to find a way to get an upload date
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import time
from multiprocessing import Pool
from requests.exceptions import HTTPError
start = time.time()
file = open('users.txt', 'r', encoding="ISO-8859-1")
urls = file.readlines()
for url in urls:
url = url.strip ('\n')
try:
req = requests.get(url)
req.raise_for_status()
except HTTPError as http_err:
output = open('output2.txt', 'a')
output.write(f'не найдена\n')
except Exception as err:
output = open('output2.txt', 'a')
output.write(f'не найдены\n')
else:
output = open('output2.txt', 'a')
soup = BeautifulSoup(req.text, "lxml")
the_url = soup.select("[rel='canonical']")[0]['href']
the_url2=the_url.replace('https://www.instagram.com/','')
head, sep, tail = the_url2.partition('/')
output.write (head+'\n')

Get geolocation from webpage by scraping using Selenium

I try to scrape this page :
The goal is to collect the latitude and the longitude. However I see that the HTML content don't change after any submit in "Adresse" case, and I don't know if this is the problem of my "empty list".
My script :
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.coordonnees-gps.fr/"
chrome_path = r"C:\Users\jbourbon\Desktop\chromedriver_win32 (1)\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.maximize_window()
driver.get(url)
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
g_data = soup.findAll("div", {"class": "col-md-9"})
latitude = []
longitude = []
for item in g_data:
latitude.append(item.contents[0].findAll("input", {"id": "latitude"}))
longitude.append(item.contents[0].findAll("input", {"id": "longitude"}))
print(latitude)
print(longitude)
And here, what I have with my list
Ty :)
There's nothing wrong with what you're doing, the problem is that when you open the link using selenium or requests the geolocation is not available instantly, it's available after few seconds (aaf9ec0.js adds it to the html dynamically, so request won't work anyway), also seems like input#latitude is also not giving the values, you can get it from div#info_window.
I've modified the code a bit, it should get you the lat, long, works for me:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import re
url = "https://www.coordonnees-gps.fr/"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
time.sleep(2) # wait for geo location to become ready
# no need to fetch it again using requests, we've already done it using selenium above
#r = requests.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
g_data = soup.findAll("div", {"id": "info_window"})[0]
#extract latitude, longitutde
latitude, longitude = re.findall(r'Latitude :</strong> ([\.\d]*) \| <strong>Longitude :</strong> ([\.\d]*)<br/>', str(g_data))[0]
print(latitude)
print(longitude)
Output
25.594095
85.137565
My best guess is that you have to enable GeoLocation for the chromedriver instance. See this answer for a lead on how to do that.

Resources