can't scape a value from Beautifulsoup in python - python-3.x

I am creating a website where I display the current wind. when I go to https://www.windguru.cz/station/219 (and click on inspect element at the max:{wind}) I can see this:
<span class="wgs_wind_max_value">12</span>
the 12 is the value I need but when I try to scrape it with bs4 and requests, this appears as output:
<span class="wgs_wind_max_value"></span>
as you can see there is no '12' value.
can someone help me with that?
from bs4 import BeautifulSoup
import requests
page = requests.get('https://www.windguru.cz/3323')
soup = BeautifulSoup(page.content, "lxml")
table = soup.find_all("span",{"class","wgs_wind_max_value"})
print(table)

Use the same API as page does to get json to populate those values. Notice the querystring construction passed to the API.
import requests
headers = {'Referer' : 'https://www.windguru.cz/station/219'}
r = requests.get('https://www.windguru.cz/int/iapi.php?q=station_data_current&id_station=219&date_format=Y-m-d%20H%3Ai%3As%20T&_mha=f4d18b6c', headers = headers).json()
print(r)
print(r['wind_max'])

Related

How to filter on this artifact in the HTML?

I am using this code and it works:
from bs4 import BeautifulSoup
import sys
import requests
page = requests.get("https://finance.yahoo.com/quote/GOOGL?p=GOOGL")
soup = BeautifulSoup(page.content, 'html.parser')
fin-streamer= soup.find("fin-streamer", class_="Fz(36px)")
print(fin-streamer)
print (fin-streamer.get_text())
It prints this for fin-streamer:
<fin-streamer active="" class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-field="regularMarketPrice" data-pricehint="2" data-reactid="47" data-symbol="GOOGL" data-test="qsp-price" data-trend="none" value="2897.04">2,897.04</fin-streamer>
What I'd like to do is filter on something more useful than the Fz(36px) class, such as
data-symbol="GOOGL"
but I don't know the syntax for that.
How to select?
To select your element more specific, simply take use of css selectors:
soup.select_one('fin-streamer[data-symbol="GOOGL"]')
Above line selects the first <fin-streamer> with attribute data-symbol="GOOGL" - To get its value just call ['value] as alternativ call .text method.
Note: There is a difference in format of value / text
Example
from bs4 import BeautifulSoup
import requests
page = requests.get("https://finance.yahoo.com/quote/GOOGL?p=GOOGL")
soup = BeautifulSoup(page.content, 'html.parser')
soup.select_one('fin-streamer[data-symbol="GOOGL"]')['value']
You can use a dictionary:
fin_streamer = soup.find("fin-streamer", {"data-symbol":"GOOGL"})

Extract max site number from web page

I need ixtract max page number from propertyes web site. Screen: .
My code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.nehnutelnosti.sk/vyhladavanie/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'inzeraty')
sitenums = soup.find_all('ul', class_='component-pagination__items d-flex align-items-center')
sitenums.find_all('li', class_='component-pagination__item')
My code returns error:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Thanks for any help.
You could use a css selector and grab the second value from end.
Here's how:
import requests
from bs4 import BeautifulSoup
css = ".component-pagination .component-pagination__item a, .component-pagination .component-pagination__item span"
page = requests.get('https://www.nehnutelnosti.sk/vyhladavanie/')
soup = BeautifulSoup(page.content, 'html.parser').select(css)[-2]
print(soup.getText(strip=True))
Output:
2309
Similar idea but doing faster filtering within css selectors rather than indexing, using nth-last-child
The :nth-last-child() CSS pseudo-class matches elements based on their
position among a group of siblings, counting from the end.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nehnutelnosti.sk/vyhladavanie')
soup = bs(r.text, "lxml")
print(int(soup.select_one('.component-pagination__item:nth-last-child(2) a').text.strip()))

How can I scrape a <h1> tag using BeautifulSoup? [Python]

I am currently coding a price tracker for different websites, but I have run into an issue.
I'm trying to scrape the contents of a h1 tag using BeautifulSoup4, but I don't know how. I've tried to use a dictionary, as suggested in
https://stackoverflow.com/a/40716482/14003061, but it returned None.
Can someone please help? It would be appreciated!
Here's the code:
from termcolor import colored
import requests
from bs4 import BeautifulSoup
import smtplib
def choice_bwfo():
print(colored("You have selected Buy Whole Foods Online [BWFO]", "blue"))
url = input(colored("\n[ 2 ] Paste a product link from BWFO.\n", "magenta"))
url_verify = requests.get(url, headers=headers)
soup = BeautifulSoup(url_verify.content, 'html5lib')
item_block = BeautifulSoup.find('h1', {'itemprop' : 'name'})
print(item_block)
choice_bwfo()
here's an example URL you can use:
https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html
Thanks :)
This script will print content of <h1> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html'
# create `soup` variable from the URL:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# print text of first `<h1>` tag:
print(soup.h1.get_text())
Prints:
Organic Spanish Bee Pollen 250g
Or you can do:
print(soup.find('h1', {'itemprop' : 'name'}).get_text())

results of soup.find are none despite the content exisiting

I'm trying to track the price for a product on amazon using python in jupyter notebook. I've imported bs4 and requests for this task.
When I inspect HTML in the product page I can see <span id="productTitle" class="a-size-large">
However when I try to search for it using soup.find(id = "productTitle") The results come out as None
I've tried using soup.find other id and classes but the results are still None
title = soup.find(id="productTitle")
This is my code to find the id
If I fix this I hope to be able to get the name of my product whose price I will be tracking
That info is stored in various places in return html. Have you check your response to see you are not blocked or getting an unexpected response?
I found it with that id using and strip
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#productTitle').text.strip())
Also,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#imgTagWrapperId img[alt]')['alt'])

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

I am trying to download a list of voting intention opinion polls from this web page using beautiful soup. However, the code I wrote returns an empty array or nothing. The code I used is below:
The page code is like this:
<div class="ST-c2-dv1 ST-ch ST-PS" style="width:33px"></div>
<div class="ST-c2-dv2">41.8</div>
That's what I tried:
import requests
from bs4 import BeautifulSoup
request = requests.get(quote_page) # take the page link
page = request.content # extract page content
soup = BeautifulSoup(page, "html.parser")
# extract all the divs
for each_div in soup.findAll('div',{'class':'ST-c2-dv2'}):
print each_div
At this point, it prints nothing.
I've tried also this:
tutti_a = soup.find_all("html_element", class_="ST-c2-dv2")
and also:
tutti_a = soup.find_all("div", class_="ST-c2-dv2")
But I get an empty array [] or nothing at all
I think you can use the following url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.marktest.com/wap/a/sf/v~[73D5799E1B0E]/name~Dossier_5fSondagensLegislativas_5f2011.HighCharts.Sondagens.xml.aspx')
soup = bs(r.content, 'lxml')
results = []
for record in soup.select('p'):
results.append([item.text for item in record.select('b')])
df = pd.DataFrame(results)
print(df)
Columns 5,6,7,8,9,10 correspond with PS, PSD,CDS,CDU,Bloco,Outros/Brancos/Nulos
You can drop unwanted columns, add appropriate headers etc.

Resources