Beautiful Soup find td by id why isn't this working - python-3.x

I'm trying to get the the Real Estimate price i.e. the 187.40
https://www.marketscreener.com/MICROSOFT-CORPORATION-4835/?type_recherche=rapide&mots=MSFT
It has the following html td#zbjsfv_dr
So I have done the following using Beautiful Soup
Comp = soup.find("td", id="zbjsfv_dr")
print(Comp)
But this isn't returning anything. I don't understand why?

I think there is something wrong about your bs4 connection because I can get value in td which id=zbjsfv_dr . You didn't share all code so This is just example:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.marketscreener.com/MICROSOFT-CORPORATION-4835/?type_recherche=rapide&mots=MSFT')
source = BeautifulSoup(r.content,'html')
comp = source.find("td", id="zbjsfv_dr")
print(comp.text)
OUTPUT:
188.085

Related

Use beautifulsoup to download href links

Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL:
https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)
Those files are all associated with area tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil/' + i['href'] for i in soup.select('area')]
You can convert page to a string in order to search for all a's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)

BeautifulSoup prints none even the content is there

I am trying to build a hackernews scraper but when i ran my code
import requests
from bs4 import BeautifulSoup
res = requests.get("https://news.ycombinator.com/")
soup = BeautifulSoup(res.text,'html.parser')
print(soup.find(id="score_23174015"))
I am Not getting that why beautifulsoup is returning none all the time to me i am still learning so yeah i am new to python3 as well
I checked the url, but there is no element with id = 23174015.
Anyway, try this code if you want to find element with attributes.
soup.find(attrs = {'id':"score_23167794"})

results of soup.find are none despite the content exisiting

I'm trying to track the price for a product on amazon using python in jupyter notebook. I've imported bs4 and requests for this task.
When I inspect HTML in the product page I can see <span id="productTitle" class="a-size-large">
However when I try to search for it using soup.find(id = "productTitle") The results come out as None
I've tried using soup.find other id and classes but the results are still None
title = soup.find(id="productTitle")
This is my code to find the id
If I fix this I hope to be able to get the name of my product whose price I will be tracking
That info is stored in various places in return html. Have you check your response to see you are not blocked or getting an unexpected response?
I found it with that id using and strip
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#productTitle').text.strip())
Also,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#imgTagWrapperId img[alt]')['alt'])

How can i get the links under a specific class

So 2 days ago i was trying to parse the data between two same classes and Keyur helped me a lot then after he left other problems behind.. :D
Page link
Now I want to get the links under a specific class, here is my code, and here are the errors.
from bs4 import BeautifulSoup
import urllib.request
import datetime
headers = {} # Headers gives information about you like your operation system, your browser etc.
headers['User-Agent'] = 'Mozilla/5.0' # I defined a user agent because HLTV perceive my connection as bot.
hltv = urllib.request.Request('https://www.hltv.org/matches', headers=headers) # Basically connecting to website
session = urllib.request.urlopen(hltv)
sauce = session.read() # Getting the source of website
soup = BeautifulSoup(sauce, 'lxml')
a = 0
b = 1
# Getting the match pages' links.
for x in soup.find('span', text=datetime.date.today()).parent:
print(x.find('a'))
Error:
Actually there isn't any error but it outputs like:
None
None
None
-1
None
None
-1
Then i researched and saw that if there isn't any data to give, find function gives you nothing which is none.
Then i tried to use find_all
Code:
print(x.find_all('a'))
Output:
AttributeError: 'NavigableString' object has no attribute 'find_all'
This is the class name:
<div class="standard-headline">2018-05-01</div>
I don't want to post all the code to here so here is the link hltv.org/matches/ so you can check the classes more easily.
I'm not quite sure I could understand what links OP really wants to grab. However, I took a guess. The links are within compound classes a-reset block upcoming-match standard-box and if you can spot the right class then one individual calss will suffice to fetch you the data like selectors do. Give it a shot.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib.parse import urljoin
import datetime
url = 'https://www.hltv.org/matches'
req = Request(url, headers={"User-Agent":"Mozilla/5.0"})
res = urlopen(req).read()
soup = BeautifulSoup(res, 'lxml')
for links in soup.find(class_="standard-headline",text=(datetime.date.today())).find_parent().find_all(class_="upcoming-match")[:-2]:
print(urljoin(url,links.get('href')))
Output:
https://www.hltv.org/matches/2322508/yeah-vs-sharks-ggbet-ascenso
https://www.hltv.org/matches/2322633/team-australia-vs-team-uk-showmatch-csgo
https://www.hltv.org/matches/2322638/sydney-saints-vs-control-fe-lil-suzi-winner-esl-womens-sydney-open-finals
https://www.hltv.org/matches/2322426/faze-vs-astralis-iem-sydney-2018
https://www.hltv.org/matches/2322601/max-vs-fierce-tiger-starseries-i-league-season-5-asian-qualifier
and so on ------

Python - Issue Scraping with BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
Process:
Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

Resources