Scraping with Python 3

Scraping with Python 3 - python-3.x

Python3:
I'm new to scraping and to train I'm trying to get all the functions from this page:
https://www.w3schools.com/python/python_ref_functions.asp
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
print(soup.td.text)
# Output: abs()
no matter what I try, I only get the 1st one: abs()
Can you help me get them all from abs() to zip()?

To get all similar tags from any webpage use find_all() it returns list of item .
To get all single tag use find() it returns single item.
trick is to get parent tag of all elements which you need then use different methods of your choice and convenience Here you can find more.
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
#scrape table which contains all functions
tabledata = soup.find("table", attrs={"class": "w3-table-all notranslate"})
#print(tabledata)
#from table data get all a tags of functions
functions = tabledata.find_all("a")
#find_all() method returns list of elements iterate over it
for func in functions:
print(func.contents)

You can use find_all to iterate through ancestors that match the selector:
for tag in soup.find_all('td'):
print(tag.text)
This will include the Description column though, so you'll need to change this to ignore cells.
soup.td will only return the first matching tag.
So one solution would be:
for tag in soup.find_all('tr'):
cell = tag.td
if cell:
print(cell.text)

Related

Extract max site number from web page

I need ixtract max page number from propertyes web site. Screen: .
My code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.nehnutelnosti.sk/vyhladavanie/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'inzeraty')
sitenums = soup.find_all('ul', class_='component-pagination__items d-flex align-items-center')
sitenums.find_all('li', class_='component-pagination__item')
My code returns error:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Thanks for any help.

You could use a css selector and grab the second value from end.
Here's how:
import requests
from bs4 import BeautifulSoup
css = ".component-pagination .component-pagination__item a, .component-pagination .component-pagination__item span"
page = requests.get('https://www.nehnutelnosti.sk/vyhladavanie/')
soup = BeautifulSoup(page.content, 'html.parser').select(css)[-2]
print(soup.getText(strip=True))
Output:
2309

Similar idea but doing faster filtering within css selectors rather than indexing, using nth-last-child
The :nth-last-child() CSS pseudo-class matches elements based on their
position among a group of siblings, counting from the end.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nehnutelnosti.sk/vyhladavanie')
soup = bs(r.text, "lxml")
print(int(soup.select_one('.component-pagination__item:nth-last-child(2) a').text.strip()))

Beautiful Soup Nested Loops

I was hoping to create a list of all of the firms featured on this list. I was hoping each winner would be their own section in the HTML but it looks like there are multiple grouped together across several divs. How would you recommend going about solving this? I was able to pull all of the divs but i dont know how to cycle through them appropriately. Thanks!
import requests
from bs4 import BeautifulSoup
import csv
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
element = soup.find()
person = soup.find_all('div', class_="under40")

This solution uses css selectors
import requests
from bs4 import BeautifulSoup
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
# if you have an older version you'll need to use contains instead of -soup-contains
firm_tags = soup.select('h5:-soup-contains("Firm") strong')
# extract the text from the selected bs4.Tags
firms = [tag.text for tag in firm_tags]
# if there is extra whitespace
clean_firms = [f.strip() for f in firms]
It works by selecting all the strong tags whose parent h5 tag contain the word "Firm"
See the SoupSieve Docs for more info on bs4's CSS Selectors

Python 3 BeautifulSoup with regex returns None

I'm having trouble getting BeautifulSoup with regex to work. I have tested the regex and it seems to work but BeautifulSoup still returns None.
Example of code I want to find
body class="page-template-default page page-id-1864
My code:
element = soup.find(text=re.compile(r"((body class).*.(page-id-\d+))"))
I have also tried with just the below and it still returns None
element = soup.find(text=re.compile(r"(body class)"))
I can confirm that the section is part of the response.content

You can try this:
from bs4 import BeautifulSoup
data = """
<body class="home page-template-default page page-id-10 original wpb-js-composer js-comp-ver-6.4.1 vc_responsive" data-footer-reveal="false" data-footer-reveal-shadow="none" data-header-format="default" data-body-border="off" data-boxed-style="" data-header-breakpoint="1000" data-dropdown-style="minimal" data-cae="linear" data-cad="650" data-megamenu-width="contained" data-aie="none" data-ls="none" data-apte="standard" data-hhun="0" data-fancy-form-rcs="default" data-form-style="default" data-form-submit="default" data-is="minimal" data-button-style="default" data-user-account-button="false" data-flex-cols="true" data-col-gap="default" data-header-inherit-rc="false" data-header-search="false" data-animated-anchors="false" data-ajax-transitions="false" data-full-width-header="false" data-slide-out-widget-area="true" data-slide-out-widget-area-style="slide-out-from-right" data-user-set-ocm="off" data-loading-animation="none" data-bg-header="false" data-responsive="1" data-ext-responsive="true" data-header-resize="1" data-header-color="custom" data-transparent-header="false" data-cart="false" data-remove-m-parallax="" data-remove-m-video-bgs="" data-m-animate="0" data-force-header-trans-color="light" data-smooth-scrolling="0" data-permanent-transparent="false" cz-shortcut-listen="true">...</body>
"""
soup = BeautifulSoup(data, "html.parser")
Find body and get its class, slice to get first 4 classes:
classText = soup.find('body').attrs['class'][:4]
Convert list to string and slice the last characters:
' '.join(map(str, classText))[:-2]
Output:
'home page-template-default page page-id-'
Not bulletproof, but with that little context.

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

I am trying to download a list of voting intention opinion polls from this web page using beautiful soup. However, the code I wrote returns an empty array or nothing. The code I used is below:
The page code is like this:
<div class="ST-c2-dv1 ST-ch ST-PS" style="width:33px"></div>
<div class="ST-c2-dv2">41.8</div>
That's what I tried:
import requests
from bs4 import BeautifulSoup
request = requests.get(quote_page) # take the page link
page = request.content # extract page content
soup = BeautifulSoup(page, "html.parser")
# extract all the divs
for each_div in soup.findAll('div',{'class':'ST-c2-dv2'}):
print each_div
At this point, it prints nothing.
I've tried also this:
tutti_a = soup.find_all("html_element", class_="ST-c2-dv2")
and also:
tutti_a = soup.find_all("div", class_="ST-c2-dv2")
But I get an empty array [] or nothing at all

I think you can use the following url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.marktest.com/wap/a/sf/v~[73D5799E1B0E]/name~Dossier_5fSondagensLegislativas_5f2011.HighCharts.Sondagens.xml.aspx')
soup = bs(r.content, 'lxml')
results = []
for record in soup.select('p'):
results.append([item.text for item in record.select('b')])
df = pd.DataFrame(results)
print(df)
Columns 5,6,7,8,9,10 correspond with PS, PSD,CDS,CDU,Bloco,Outros/Brancos/Nulos
You can drop unwanted columns, add appropriate headers etc.

BeautifulSoup find attribute value in any tag

How to find value of certain attribute using bs4? For example, I need to find all values of src attribute, it could be in any tag of my html document.

You can do something like this:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://your.url')
soup = BeautifulSoup(r.text,'html.parser')
attr_src = []
for tag in soup():
if 'src' in tag.attrs:
attr_src.append(tag.get('src'))
print(attr_src)

Just use an attribute selector (that's what it's intended for). More efficient.
values = [item['src'] for item in soup.select('[src]')]
You can extend by adding the required string/substring of a desired value by adding = substring/string after the attribute i.e. [src="mystring"]
Example:
import requests
from bs4 import BeautifulSoup as bs
res = requests.get('https://stackoverflow.com/questions/55060825/beautifulsoup-find-attribute-value-in-any-tag/55062258#55062258')
soup = bs(res.content, 'lxml')
values = [item['src'] for item in soup.select('[src]')]
print(values)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Scraping with Python 3 - python-3.x

Related

Extract max site number from web page

Beautiful Soup Nested Loops

Python 3 BeautifulSoup with regex returns None

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

BeautifulSoup find attribute value in any tag

Categories

Resources