Parsing through HTML with BeautifulSoup in Python - python-3.x

Currently my code is as follows:
from bs4 import BeautifulSoup
import requests
main_url = 'http://www.foodnetwork.com/recipes/a-z'
response = requests.get(main_url)
soup = BeautifulSoup(response.text, "html.parser")
mylist = [t for tags in soup.find_all(class_='m-PromoList o-Capsule__m-
PromoList') for t in tags if (t!='\n')]
As of now, I get a list containing the correct information but its still inside of HTML tags. An example of an element of the list is given below:
<li class="m-PromoList__a-ListItem">"16 Bean" Pasta E Fagioli</li>
from this item I want to extract both the href link and also the following string separately, but I am having trouble doing this and I really don't think getting this info should require a whole new set of operations. How do?

You can do this to get href and text for one element:
href = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')['href']
text = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a').text
For a list of items:
my_list = soup.find_all('li', attrs={'class':'m-PromoList__a-ListItem'})
for el in my_list:
href = el.find('a')['href']
text = el.find('a').text
print(href)
print(text)
Edit:
An important tip to reduce run time: Don't search for the same tag more than once. Instead, save the tag in a variable and then use it multiple times.
a = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')
href = a.get('href')
text = a.text
In large HTML codes, finding a tag takes up lot of time, so doing this will reduce the time taken to find the tag as it will run only once.

Several ways you can achieve the same. Here is another approach using css selector:
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.foodnetwork.com/recipes/a-z')
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".m-PromoList__a-ListItem a"):
print("Item_Title: {}\nItem_Link: {}\n".format(item.text,item['href']))
Partial result:
Item_Title: "16 Bean" Pasta E Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-3612570
Item_Title: "16 Bean" Pasta e Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-1-3753755
Item_Title: "21" Apple Pie
Item_Link: //www.foodnetwork.com/recipes/21-apple-pie-recipe-1925900

Related

How to scrape href using bs4 and Python

I am trying to scrape the "href" links from the page, but the result is "none", can you please help me find where my code is going wrong? why is the code returning "none"?
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://news.ycombinator.com/newest')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titleline')
def fit_hn(links):
hn = []
for idx, item in enumerate(links):
href = links[idx].get('href')
hn.append(href)
return hn
pprint.pprint(fit_hn(links))
Let's take a deeper look.
if you were to print links, you'll see that it's returning the span with the a:
<span class="titleline">The Strangely Beautiful Experience of Google Reviews<span class="sitebit comhead"> (<span class="sitestr">longreads.com</span>)</span></span>
So, really you need to go one level deeper and select the <a> tag.
Change your CSS selector to also select the a tags:
links = soup.select('.titleline a')
Your above code now prints:
['https://www.the-sun.com/tech/7078358/xoxe-ai-woman-detects-anxiety-and-crime-afterlife/',
'from?site=the-sun.com',
'https://www.economist.com/business/2023/01/05/how-to-avoid-flight-chaos',
'from?site=economist.com',
'https://www.thegutterreview.com/but-who-is-the-artist-the-kenny-who-trilogy-and-the-reality-of-ai-art/',
'from?site=thegutterreview.com',
'https://twitter.com/jburnmurdoch/status/1606223967903260673',
'from?site=twitter.com/jburnmurdoch',
'https://arstechnica.com/gadgets/2023/01/newest-raspberry-pi-camera-module-3-adds-autofocus-wide-view-hdr/',

Beautiful Soup Nested Loops

I was hoping to create a list of all of the firms featured on this list. I was hoping each winner would be their own section in the HTML but it looks like there are multiple grouped together across several divs. How would you recommend going about solving this? I was able to pull all of the divs but i dont know how to cycle through them appropriately. Thanks!
import requests
from bs4 import BeautifulSoup
import csv
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
element = soup.find()
person = soup.find_all('div', class_="under40")
This solution uses css selectors
import requests
from bs4 import BeautifulSoup
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
# if you have an older version you'll need to use contains instead of -soup-contains
firm_tags = soup.select('h5:-soup-contains("Firm") strong')
# extract the text from the selected bs4.Tags
firms = [tag.text for tag in firm_tags]
# if there is extra whitespace
clean_firms = [f.strip() for f in firms]
It works by selecting all the strong tags whose parent h5 tag contain the word "Firm"
See the SoupSieve Docs for more info on bs4's CSS Selectors

Need help scraping a loose piece of text from webpage without div using bs4/python

I'm learning python since last week and i need to scrape info about cities on a website. I manage to crawl the whole site but i can't quite scrape the specific text info i need in each city webpage (here's the url of one of the cities info (http://www.mon-maire.fr/maire-de-abbecourt-02))
here's the block i'm working from
<div class="constructeur">
<b>Village: </b>Abbécourt <br/>
<b>Population :</b> 536 habitants <br/>
<b>Département :</b> Aisne <br/>
<b>Code postal :</b> 02300 <br/>
</div>
i'm trying to create a list like this whith the loose text in it
list = [Abbécourt,536 habitants,Aisne,02300]
I came up with this code
import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'http://www.mon-maire.fr/maire-de-abbecourt-02'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
sidebar = page_soup.findAll("div", {"class":"constructeur"})
for li in sidebar:
b = li.findAll('br')
print(b)
but it prints only [<br/>, <br/>, <br/>, <br/>]
When looking into bs4 doc, i tried
b = li.findAll('br.next_element')
b = li.findAll('br.previous_element')
but it doesn't work. I'm still looking into the bs4 doc for a solution but in the meantime, if someone would be kind enough to help me, it would be awesome.
b = [i.next_sibling.strip() for i in page_soup.select('div.constructeur > b')]
Rather than find the parent element just find the sub-elements you actually care about, that is done using bs4's CSS selectors (.select), what the string, 'div.constructeur > b', says is grab all bold tags which are within a div element with class constructeur, this returns a list.
Iterate over that list of b tags using list comprehension, grab the next_sibling to the b tags which will be the data you want, strip the text since it has a bunch of whitespace.
The reason why li.findAll('br.next_element') doesn't work is because that function operates on a tag object which is what is contained in the list returned by .findAll.
What you want is
b = li.findAll('br')
b = [i.previous_element.strip() for i in b]

Scraping with Python 3

Python3:
I'm new to scraping and to train I'm trying to get all the functions from this page:
https://www.w3schools.com/python/python_ref_functions.asp
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
print(soup.td.text)
# Output: abs()
no matter what I try, I only get the 1st one: abs()
Can you help me get them all from abs() to zip()?
To get all similar tags from any webpage use find_all() it returns list of item .
To get all single tag use find() it returns single item.
trick is to get parent tag of all elements which you need then use different methods of your choice and convenience Here you can find more.
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
#scrape table which contains all functions
tabledata = soup.find("table", attrs={"class": "w3-table-all notranslate"})
#print(tabledata)
#from table data get all a tags of functions
functions = tabledata.find_all("a")
#find_all() method returns list of elements iterate over it
for func in functions:
print(func.contents)
You can use find_all to iterate through ancestors that match the selector:
for tag in soup.find_all('td'):
print(tag.text)
This will include the Description column though, so you'll need to change this to ignore cells.
soup.td will only return the first matching tag.
So one solution would be:
for tag in soup.find_all('tr'):
cell = tag.td
if cell:
print(cell.text)

Web Crawler keeps saying no attribute even though it really has

I have been developing a web-crawler for this website (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1). But I have a trouble at crawling each title of the stock. I am pretty sure that there is attribute for carinfo_title = carinfo.find_all('a', class_='title').
Please check out the attached code and website code, and then give me any advice.
Thanks.
(Website Code)
https://drive.google.com/open?id=0BxKswko3bYpuRV9seTZZT3REak0
(My code)
from bs4 import BeautifulSoup
import urllib.request
target_url = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=1"
def fetch_post_list():
URL = target_url
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
#Car Info and Link
carinfo = table.find_all('td', class_='carinfo')
carinfo_title = carinfo.find_all('a', class_='title')
print (carinfo_title)
return carinfo_title
fetch_post_list()
You have multiple elements with the carinfo class and for every "carinfo" you need to get to the car title. Loop over the result of the table.find_all('td', class_='carinfo'):
for carinfo in table.find_all('td', class_='carinfo'):
carinfo_title = carinfo.find('a', class_='title')
print(carinfo_title.get_text())
Would print:
미니 쿠퍼 S JCW
지프 랭글러 3.8 애니버서리 70주년 에디션
...
벤츠 뉴 SLK200 블루이피션시
포르쉐 뉴 카이엔 4.8 GTS
마쯔다 MPV 2.3
Note that if you need only car titles, you can simplify it down to a single line:
print([elm.get_text() for elm in soup.select('table.cyber td.carinfo a.title')])
where the string inside the .select() method is a CSS selector.

Resources