Web Scraping returns empty values - python-3.x

I'm new to Python. I trying to run a web scraping app.
When I run the below python scripts, I am getting empty values. Please advice.
import bs4
import requests
url2= 'https://bitcoinfees.info/'
res2= requests.get(url)
soup2 = bs4.BeautifulSoup(res2.text,'html.parser')
highfee= soup2.select_one('html.wf-roboto-n5-active.wf-roboto-n4- active.wf-active body div.container ul.list-group li.list-group-item span.badge').text
print(highfee)

Two errors in your example. requests.get(url) should be (url2) and then highfee has a bunch of stuff in there. It seems like you are just looking for the first span. In this case you can do soup2.select_one('span').text So, all together you have
url2= 'https://bitcoinfees.info/'
res2= requests.get(url2)
soup2 = bs4.BeautifulSoup(res2.text,'html.parser')
highfee= soup2.select_one('span').text
print(highfee)
if it is a different span you are looking for you can use soup2.find() in this case, you are looking for the tag <span class="badge"> You can search these by using
soup2.find("span", class_="badge").string
see the soup docs for searching by css class

Related

Beautiful Soup Nested Loops

I was hoping to create a list of all of the firms featured on this list. I was hoping each winner would be their own section in the HTML but it looks like there are multiple grouped together across several divs. How would you recommend going about solving this? I was able to pull all of the divs but i dont know how to cycle through them appropriately. Thanks!
import requests
from bs4 import BeautifulSoup
import csv
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
element = soup.find()
person = soup.find_all('div', class_="under40")
This solution uses css selectors
import requests
from bs4 import BeautifulSoup
request = requests.get("https://growthcapadvisory.com/growthcaps-top-40-under-40-growth-investors-of-2020/")
text = request.text
soup = BeautifulSoup(text, 'html.parser')
# if you have an older version you'll need to use contains instead of -soup-contains
firm_tags = soup.select('h5:-soup-contains("Firm") strong')
# extract the text from the selected bs4.Tags
firms = [tag.text for tag in firm_tags]
# if there is extra whitespace
clean_firms = [f.strip() for f in firms]
It works by selecting all the strong tags whose parent h5 tag contain the word "Firm"
See the SoupSieve Docs for more info on bs4's CSS Selectors

how to read links from a list with beautifulsoup?

I have a list with lots of links and I want to scrape them with beautifulsoup in Python 3
links is my list and it contains hundreds of urls. I have tried this code to scrape them all, but it's not working for some reason
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html',...]
raw = urlopen(i in links).read()
ufos_doc = BeautifulSoup(raw, "html.parser")
raw should be a list containing the data of each web-page. For each entry in raw, parse it and create a soup object. You can store each soup object in a list (I called it soups):
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
raw = [urlopen(i).read() for i in links]
soups = []
for page in raw:
soups.append(BeautifulSoup(page,'html.parser'))
You can then access eg. the soup object for the first link with soups[0].
Also, for fetching the response of each URL, consider using the requests module instead of urllib. See this post.
You need a Loop over the list links. If you have a lot of these to do, as mentioned in other answer, consider requests. With requests you can create a Session object which will allow you to re-use connection thereby more efficiently scraping
import requests
from bs4 import BeautifulSoup as bs
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
with requests.Session as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
#do something

Python - Issue Scraping with BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
Process:
Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

How to take content from html tag style with Python 3?

Let's say I have this:
<bgi align="br" bgalp="100" bgc="ccffff" hasrec="0" ialp="49" isvid="1" tile="0" useimg="1"/>
I simply want to take "CCFFFF" from bgc, but don't know how to do it since this information varies. Was trying with re.compile but I'm really new in this...
One option is to use BeautifulSoup, and call the 'bgc' attribute of the 'bgi' tag:
from bs4 import BeautifulSoup
html_doc = """<bgi align="br" bgalp="100" bgc="ccffff" hasrec="0" ialp="49" isvid="1" tile="0" useimg="1"/>"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.bgi['bgc'])
output:
ccffff

get text from <span> with beautiful soup

I want to get text from span tag but i have such problems.
I wrote this,
import bs4 as bs
import urllib.request
page = urllib.request.urlopen('http://www.accuweather.com/en/az/baku/27103/current-weather/27103').read()
soup = bs.BeautifulSoup(page, 'html.parser')
print(soup.find_all('li', class_='wind'))
and it returned like that [<li class="wind"><strong>28 km/h</strong></li>]
but I want to get just "28 km/h"
then I tried that
page = urllib.request.urlopen('http://www.accuweather.com/en/az/baku/27103/current-weather/27103').read()
soup = bs.BeautifulSoup(page, 'html.parser')
print(soup.find_all("span" , { "class" : "wind" }))
but it did not work either. Please help me with it.
You need to use .find() and not .find_all() to get a single element and call .get_text() to get the text of the desired element:
print(soup.find('li', class_='wind').get_text())
Or, you can also use .select_one() and locate the same element using a CSS selector:
print(soup.select_one('li.wind').get_text())
As a side note, look up the "AccuWeather API" - that might be a faster, easier and a more appropriate way to get to the desired data.

Resources