Extract max site number from web page - python-3.x

I need ixtract max page number from propertyes web site. Screen: .
My code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.nehnutelnosti.sk/vyhladavanie/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'inzeraty')
sitenums = soup.find_all('ul', class_='component-pagination__items d-flex align-items-center')
sitenums.find_all('li', class_='component-pagination__item')
My code returns error:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Thanks for any help.

You could use a css selector and grab the second value from end.
Here's how:
import requests
from bs4 import BeautifulSoup
css = ".component-pagination .component-pagination__item a, .component-pagination .component-pagination__item span"
page = requests.get('https://www.nehnutelnosti.sk/vyhladavanie/')
soup = BeautifulSoup(page.content, 'html.parser').select(css)[-2]
print(soup.getText(strip=True))
Output:
2309

Similar idea but doing faster filtering within css selectors rather than indexing, using nth-last-child
The :nth-last-child() CSS pseudo-class matches elements based on their
position among a group of siblings, counting from the end.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nehnutelnosti.sk/vyhladavanie')
soup = bs(r.text, "lxml")
print(int(soup.select_one('.component-pagination__item:nth-last-child(2) a').text.strip()))

Related

How to filter on this artifact in the HTML?

I am using this code and it works:
from bs4 import BeautifulSoup
import sys
import requests
page = requests.get("https://finance.yahoo.com/quote/GOOGL?p=GOOGL")
soup = BeautifulSoup(page.content, 'html.parser')
fin-streamer= soup.find("fin-streamer", class_="Fz(36px)")
print(fin-streamer)
print (fin-streamer.get_text())
It prints this for fin-streamer:
<fin-streamer active="" class="Fw(b) Fz(36px) Mb(-4px) D(ib)" data-field="regularMarketPrice" data-pricehint="2" data-reactid="47" data-symbol="GOOGL" data-test="qsp-price" data-trend="none" value="2897.04">2,897.04</fin-streamer>
What I'd like to do is filter on something more useful than the Fz(36px) class, such as
data-symbol="GOOGL"
but I don't know the syntax for that.
How to select?
To select your element more specific, simply take use of css selectors:
soup.select_one('fin-streamer[data-symbol="GOOGL"]')
Above line selects the first <fin-streamer> with attribute data-symbol="GOOGL" - To get its value just call ['value] as alternativ call .text method.
Note: There is a difference in format of value / text
Example
from bs4 import BeautifulSoup
import requests
page = requests.get("https://finance.yahoo.com/quote/GOOGL?p=GOOGL")
soup = BeautifulSoup(page.content, 'html.parser')
soup.select_one('fin-streamer[data-symbol="GOOGL"]')['value']
You can use a dictionary:
fin_streamer = soup.find("fin-streamer", {"data-symbol":"GOOGL"})

Use beautifulsoup to download href links

Looking to download href links using beautifulsoup4, python 3 and requests library.
This is the code that I have now, I thought it would be tough to use regex in this situation but i'm not sure if this can be done using beautifulsoup3. I have to download all of the shape files from the grid and looking to automate this task. Thank You!
URL:
https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads
import requests
from bs4 import BeautifulSoup
import re
URL = 'https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads'
page = requests.get(URL)
soup = BeautifulSoup(page.content,'html.parser')
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
print(results)
Those files are all associated with area tag so I would simply select those:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://earth-info.nga.mil/index.php?dir=coordsys&action=gars-20x20-dloads')
soup = bs(r.content, 'lxml')
files = ['https://earth-info.nga.mil/' + i['href'] for i in soup.select('area')]
You can convert page to a string in order to search for all a's using regex.
Instead of:
results = re.findall(r'<a[^>]* href="([^"]*)"', page)
Use:
results = re.findall(r'<a[^>]* href="([^"]*)"', page.text)

can't scape a value from Beautifulsoup in python

I am creating a website where I display the current wind. when I go to https://www.windguru.cz/station/219 (and click on inspect element at the max:{wind}) I can see this:
<span class="wgs_wind_max_value">12</span>
the 12 is the value I need but when I try to scrape it with bs4 and requests, this appears as output:
<span class="wgs_wind_max_value"></span>
as you can see there is no '12' value.
can someone help me with that?
from bs4 import BeautifulSoup
import requests
page = requests.get('https://www.windguru.cz/3323')
soup = BeautifulSoup(page.content, "lxml")
table = soup.find_all("span",{"class","wgs_wind_max_value"})
print(table)
Use the same API as page does to get json to populate those values. Notice the querystring construction passed to the API.
import requests
headers = {'Referer' : 'https://www.windguru.cz/station/219'}
r = requests.get('https://www.windguru.cz/int/iapi.php?q=station_data_current&id_station=219&date_format=Y-m-d%20H%3Ai%3As%20T&_mha=f4d18b6c', headers = headers).json()
print(r)
print(r['wind_max'])

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

I am trying to download a list of voting intention opinion polls from this web page using beautiful soup. However, the code I wrote returns an empty array or nothing. The code I used is below:
The page code is like this:
<div class="ST-c2-dv1 ST-ch ST-PS" style="width:33px"></div>
<div class="ST-c2-dv2">41.8</div>
That's what I tried:
import requests
from bs4 import BeautifulSoup
request = requests.get(quote_page) # take the page link
page = request.content # extract page content
soup = BeautifulSoup(page, "html.parser")
# extract all the divs
for each_div in soup.findAll('div',{'class':'ST-c2-dv2'}):
print each_div
At this point, it prints nothing.
I've tried also this:
tutti_a = soup.find_all("html_element", class_="ST-c2-dv2")
and also:
tutti_a = soup.find_all("div", class_="ST-c2-dv2")
But I get an empty array [] or nothing at all
I think you can use the following url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.marktest.com/wap/a/sf/v~[73D5799E1B0E]/name~Dossier_5fSondagensLegislativas_5f2011.HighCharts.Sondagens.xml.aspx')
soup = bs(r.content, 'lxml')
results = []
for record in soup.select('p'):
results.append([item.text for item in record.select('b')])
df = pd.DataFrame(results)
print(df)
Columns 5,6,7,8,9,10 correspond with PS, PSD,CDS,CDU,Bloco,Outros/Brancos/Nulos
You can drop unwanted columns, add appropriate headers etc.

BeautifulSoup find attribute value in any tag

How to find value of certain attribute using bs4? For example, I need to find all values of src attribute, it could be in any tag of my html document.
You can do something like this:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://your.url')
soup = BeautifulSoup(r.text,'html.parser')
attr_src = []
for tag in soup():
if 'src' in tag.attrs:
attr_src.append(tag.get('src'))
print(attr_src)
Just use an attribute selector (that's what it's intended for). More efficient.
values = [item['src'] for item in soup.select('[src]')]
You can extend by adding the required string/substring of a desired value by adding = substring/string after the attribute i.e. [src="mystring"]
Example:
import requests
from bs4 import BeautifulSoup as bs
res = requests.get('https://stackoverflow.com/questions/55060825/beautifulsoup-find-attribute-value-in-any-tag/55062258#55062258')
soup = bs(res.content, 'lxml')
values = [item['src'] for item in soup.select('[src]')]
print(values)

Resources