get text from <span> with beautiful soup - python-3.x

I want to get text from span tag but i have such problems.
I wrote this,
import bs4 as bs
import urllib.request
page = urllib.request.urlopen('http://www.accuweather.com/en/az/baku/27103/current-weather/27103').read()
soup = bs.BeautifulSoup(page, 'html.parser')
print(soup.find_all('li', class_='wind'))
and it returned like that [<li class="wind"><strong>28 km/h</strong></li>]
but I want to get just "28 km/h"
then I tried that
page = urllib.request.urlopen('http://www.accuweather.com/en/az/baku/27103/current-weather/27103').read()
soup = bs.BeautifulSoup(page, 'html.parser')
print(soup.find_all("span" , { "class" : "wind" }))
but it did not work either. Please help me with it.

You need to use .find() and not .find_all() to get a single element and call .get_text() to get the text of the desired element:
print(soup.find('li', class_='wind').get_text())
Or, you can also use .select_one() and locate the same element using a CSS selector:
print(soup.select_one('li.wind').get_text())
As a side note, look up the "AccuWeather API" - that might be a faster, easier and a more appropriate way to get to the desired data.

Related

How to scrape href using bs4 and Python

I am trying to scrape the "href" links from the page, but the result is "none", can you please help me find where my code is going wrong? why is the code returning "none"?
import requests
from bs4 import BeautifulSoup
import pprint
res = requests.get('https://news.ycombinator.com/newest')
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.select('.titleline')
def fit_hn(links):
hn = []
for idx, item in enumerate(links):
href = links[idx].get('href')
hn.append(href)
return hn
pprint.pprint(fit_hn(links))
Let's take a deeper look.
if you were to print links, you'll see that it's returning the span with the a:
<span class="titleline">The Strangely Beautiful Experience of Google Reviews<span class="sitebit comhead"> (<span class="sitestr">longreads.com</span>)</span></span>
So, really you need to go one level deeper and select the <a> tag.
Change your CSS selector to also select the a tags:
links = soup.select('.titleline a')
Your above code now prints:
['https://www.the-sun.com/tech/7078358/xoxe-ai-woman-detects-anxiety-and-crime-afterlife/',
'from?site=the-sun.com',
'https://www.economist.com/business/2023/01/05/how-to-avoid-flight-chaos',
'from?site=economist.com',
'https://www.thegutterreview.com/but-who-is-the-artist-the-kenny-who-trilogy-and-the-reality-of-ai-art/',
'from?site=thegutterreview.com',
'https://twitter.com/jburnmurdoch/status/1606223967903260673',
'from?site=twitter.com/jburnmurdoch',
'https://arstechnica.com/gadgets/2023/01/newest-raspberry-pi-camera-module-3-adds-autofocus-wide-view-hdr/',

Beautiful Soup Nested Loops

I was hoping to create a list of all of the firms featured on this list. I was hoping each winner would be their own section in the HTML but it looks like there are multiple grouped together across several divs. How would you recommend going about solving this? I was able to pull all of the divs but i dont know how to cycle through them appropriately. Thanks!
import requests
from bs4 import BeautifulSoup
import csv
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
element = soup.find()
person = soup.find_all('div', class_="under40")
This solution uses css selectors
import requests
from bs4 import BeautifulSoup
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
# if you have an older version you'll need to use contains instead of -soup-contains
firm_tags = soup.select('h5:-soup-contains("Firm") strong')
# extract the text from the selected bs4.Tags
firms = [tag.text for tag in firm_tags]
# if there is extra whitespace
clean_firms = [f.strip() for f in firms]
It works by selecting all the strong tags whose parent h5 tag contain the word "Firm"
See the SoupSieve Docs for more info on bs4's CSS Selectors

Web Scraping returns empty values

I'm new to Python. I trying to run a web scraping app.
When I run the below python scripts, I am getting empty values. Please advice.
import bs4
import requests
url2= 'https://bitcoinfees.info/'
res2= requests.get(url)
soup2 = bs4.BeautifulSoup(res2.text,'html.parser')
highfee= soup2.select_one('html.wf-roboto-n5-active.wf-roboto-n4- active.wf-active body div.container ul.list-group li.list-group-item span.badge').text
print(highfee)
Two errors in your example. requests.get(url) should be (url2) and then highfee has a bunch of stuff in there. It seems like you are just looking for the first span. In this case you can do soup2.select_one('span').text So, all together you have
url2= 'https://bitcoinfees.info/'
res2= requests.get(url2)
soup2 = bs4.BeautifulSoup(res2.text,'html.parser')
highfee= soup2.select_one('span').text
print(highfee)
if it is a different span you are looking for you can use soup2.find() in this case, you are looking for the tag <span class="badge"> You can search these by using
soup2.find("span", class_="badge").string
see the soup docs for searching by css class

Parsing through HTML with BeautifulSoup in Python

Currently my code is as follows:
from bs4 import BeautifulSoup
import requests
main_url = 'http://www.foodnetwork.com/recipes/a-z'
response = requests.get(main_url)
soup = BeautifulSoup(response.text, "html.parser")
mylist = [t for tags in soup.find_all(class_='m-PromoList o-Capsule__m-
PromoList') for t in tags if (t!='\n')]
As of now, I get a list containing the correct information but its still inside of HTML tags. An example of an element of the list is given below:
<li class="m-PromoList__a-ListItem">"16 Bean" Pasta E Fagioli</li>
from this item I want to extract both the href link and also the following string separately, but I am having trouble doing this and I really don't think getting this info should require a whole new set of operations. How do?
You can do this to get href and text for one element:
href = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')['href']
text = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a').text
For a list of items:
my_list = soup.find_all('li', attrs={'class':'m-PromoList__a-ListItem'})
for el in my_list:
href = el.find('a')['href']
text = el.find('a').text
print(href)
print(text)
Edit:
An important tip to reduce run time: Don't search for the same tag more than once. Instead, save the tag in a variable and then use it multiple times.
a = soup.find('li', attrs={'class':'m-PromoList__a-ListItem'}).find('a')
href = a.get('href')
text = a.text
In large HTML codes, finding a tag takes up lot of time, so doing this will reduce the time taken to find the tag as it will run only once.
Several ways you can achieve the same. Here is another approach using css selector:
from bs4 import BeautifulSoup
import requests
response = requests.get('http://www.foodnetwork.com/recipes/a-z')
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".m-PromoList__a-ListItem a"):
print("Item_Title: {}\nItem_Link: {}\n".format(item.text,item['href']))
Partial result:
Item_Title: "16 Bean" Pasta E Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-3612570
Item_Title: "16 Bean" Pasta e Fagioli
Item_Link: //www.foodnetwork.com/recipes/ina-garten/16-bean-pasta-e-fagioli-1-3753755
Item_Title: "21" Apple Pie
Item_Link: //www.foodnetwork.com/recipes/21-apple-pie-recipe-1925900

How to take content from html tag style with Python 3?

Let's say I have this:
<bgi align="br" bgalp="100" bgc="ccffff" hasrec="0" ialp="49" isvid="1" tile="0" useimg="1"/>
I simply want to take "CCFFFF" from bgc, but don't know how to do it since this information varies. Was trying with re.compile but I'm really new in this...
One option is to use BeautifulSoup, and call the 'bgc' attribute of the 'bgi' tag:
from bs4 import BeautifulSoup
html_doc = """<bgi align="br" bgalp="100" bgc="ccffff" hasrec="0" ialp="49" isvid="1" tile="0" useimg="1"/>"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.bgi['bgc'])
output:
ccffff

Resources