Beautiful Soup Nested Loops - python-3.x

I was hoping to create a list of all of the firms featured on this list. I was hoping each winner would be their own section in the HTML but it looks like there are multiple grouped together across several divs. How would you recommend going about solving this? I was able to pull all of the divs but i dont know how to cycle through them appropriately. Thanks!
import requests
from bs4 import BeautifulSoup
import csv
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
element = soup.find()
person = soup.find_all('div', class_="under40")

This solution uses css selectors
import requests
from bs4 import BeautifulSoup
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
# if you have an older version you'll need to use contains instead of -soup-contains
firm_tags = soup.select('h5:-soup-contains("Firm") strong')
# extract the text from the selected bs4.Tags
firms = [tag.text for tag in firm_tags]
# if there is extra whitespace
clean_firms = [f.strip() for f in firms]
It works by selecting all the strong tags whose parent h5 tag contain the word "Firm"
See the SoupSieve Docs for more info on bs4's CSS Selectors

Related

How to scrape href using bs4 and Python

I am trying to scrape the "href" links from the page, but the result is "none", can you please help me find where my code is going wrong? why is the code returning "none"?
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://news.ycombinator.com/newest')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titleline')
def fit_hn(links):
hn = []
for idx, item in enumerate(links):
href = links[idx].get('href')
hn.append(href)
return hn
pprint.pprint(fit_hn(links))
Let's take a deeper look.
if you were to print links, you'll see that it's returning the span with the a:
<span class="titleline">The Strangely Beautiful Experience of Google Reviews<span class="sitebit comhead"> (<span class="sitestr">longreads.com</span>)</span></span>
So, really you need to go one level deeper and select the <a> tag.
Change your CSS selector to also select the a tags:
links = soup.select('.titleline a')
Your above code now prints:
['https://www.the-sun.com/tech/7078358/xoxe-ai-woman-detects-anxiety-and-crime-afterlife/',
'from?site=the-sun.com',
'https://www.economist.com/business/2023/01/05/how-to-avoid-flight-chaos',
'from?site=economist.com',
'https://www.thegutterreview.com/but-who-is-the-artist-the-kenny-who-trilogy-and-the-reality-of-ai-art/',
'from?site=thegutterreview.com',
'https://twitter.com/jburnmurdoch/status/1606223967903260673',
'from?site=twitter.com/jburnmurdoch',
'https://arstechnica.com/gadgets/2023/01/newest-raspberry-pi-camera-module-3-adds-autofocus-wide-view-hdr/',

Extract max site number from web page

I need ixtract max page number from propertyes web site. Screen: .
My code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.nehnutelnosti.sk/vyhladavanie/'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'inzeraty')
sitenums = soup.find_all('ul', class_='component-pagination__items d-flex align-items-center')
sitenums.find_all('li', class_='component-pagination__item')
My code returns error:
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Thanks for any help.
You could use a css selector and grab the second value from end.
Here's how:
import requests
from bs4 import BeautifulSoup
css = ".component-pagination .component-pagination__item a, .component-pagination .component-pagination__item span"
page = requests.get('https://www.nehnutelnosti.sk/vyhladavanie/')
soup = BeautifulSoup(page.content, 'html.parser').select(css)[-2]
print(soup.getText(strip=True))
Output:
2309
Similar idea but doing faster filtering within css selectors rather than indexing, using nth-last-child
The :nth-last-child() CSS pseudo-class matches elements based on their
position among a group of siblings, counting from the end.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nehnutelnosti.sk/vyhladavanie')
soup = bs(r.text, "lxml")
print(int(soup.select_one('.component-pagination__item:nth-last-child(2) a').text.strip()))

Beautiful Soup Error: Trying to retrieve data from web page returns empty array

I am trying to download a list of voting intention opinion polls from this web page using beautiful soup. However, the code I wrote returns an empty array or nothing. The code I used is below:
The page code is like this:
<div class="ST-c2-dv1 ST-ch ST-PS" style="width:33px"></div>
<div class="ST-c2-dv2">41.8</div>
That's what I tried:
import requests
from bs4 import BeautifulSoup
request = requests.get(quote_page) # take the page link
page = request.content # extract page content
soup = BeautifulSoup(page, "html.parser")
# extract all the divs
for each_div in soup.findAll('div',{'class':'ST-c2-dv2'}):
print each_div
At this point, it prints nothing.
I've tried also this:
tutti_a = soup.find_all("html_element", class_="ST-c2-dv2")
and also:
tutti_a = soup.find_all("div", class_="ST-c2-dv2")
But I get an empty array [] or nothing at all
I think you can use the following url
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.marktest.com/wap/a/sf/v~[73D5799E1B0E]/name~Dossier_5fSondagensLegislativas_5f2011.HighCharts.Sondagens.xml.aspx')
soup = bs(r.content, 'lxml')
results = []
for record in soup.select('p'):
results.append([item.text for item in record.select('b')])
df = pd.DataFrame(results)
print(df)
Columns 5,6,7,8,9,10 correspond with PS, PSD,CDS,CDU,Bloco,Outros/Brancos/Nulos
You can drop unwanted columns, add appropriate headers etc.

Scraping with Python 3

Python3:
I'm new to scraping and to train I'm trying to get all the functions from this page:
https://www.w3schools.com/python/python_ref_functions.asp
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
print(soup.td.text)
# Output: abs()
no matter what I try, I only get the 1st one: abs()
Can you help me get them all from abs() to zip()?
To get all similar tags from any webpage use find_all() it returns list of item .
To get all single tag use find() it returns single item.
trick is to get parent tag of all elements which you need then use different methods of your choice and convenience Here you can find more.
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
#scrape table which contains all functions
tabledata = soup.find("table", attrs={"class": "w3-table-all notranslate"})
#print(tabledata)
#from table data get all a tags of functions
functions = tabledata.find_all("a")
#find_all() method returns list of elements iterate over it
for func in functions:
print(func.contents)
You can use find_all to iterate through ancestors that match the selector:
for tag in soup.find_all('td'):
print(tag.text)
This will include the Description column though, so you'll need to change this to ignore cells.
soup.td will only return the first matching tag.
So one solution would be:
for tag in soup.find_all('tr'):
cell = tag.td
if cell:
print(cell.text)

Parsing through HTML with BeautifulSoup in Python

Currently my code is as follows:
from bs4 import BeautifulSoup
import requests
main_url = 'http://www.foodnetwork.com/recipes/a-z'
response = requests.get(main_url)
soup = BeautifulSoup(response.text, "html.parser")
mylist = [t for tags in soup.find_all(class_='m-PromoList o-Capsule__m-
PromoList') for t in tags if (t!='\n')]
As of now, I get a list containing the correct information but its still inside of HTML tags. An example of an element of the list is given below:
<li class="m-PromoList__a-ListItem">"16 Bean" Pasta E Fagioli</li>
from this item I want to extract both the href link and also the following string separately, but I am having trouble doing this and I really don't think getting this info should require a whole new set of operations. How do?
You can do this to get href and text for one element:
href = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')['href']
text = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a').text
For a list of items:
my_list = soup.find_all('li', attrs={'class':'m-PromoList__a-ListItem'})
for el in my_list:
href = el.find('a')['href']
text = el.find('a').text
print(href)
print(text)
Edit:
An important tip to reduce run time: Don't search for the same tag more than once. Instead, save the tag in a variable and then use it multiple times.
a = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')
href = a.get('href')
text = a.text
In large HTML codes, finding a tag takes up lot of time, so doing this will reduce the time taken to find the tag as it will run only once.
Several ways you can achieve the same. Here is another approach using css selector:
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.foodnetwork.com/recipes/a-z')
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".m-PromoList__a-ListItem a"):
print("Item_Title: {}\nItem_Link: {}\n".format(item.text,item['href']))
Partial result:
Item_Title: "16 Bean" Pasta E Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-3612570
Item_Title: "16 Bean" Pasta e Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-1-3753755
Item_Title: "21" Apple Pie
Item_Link: //www.foodnetwork.com/recipes/21-apple-pie-recipe-1925900

Resources