Where do we put the "html.parser" argument when web scraping? - python-3.x

Look at the following snippet of code
import requests
from bs4 import BeautifulSoup
url = #Insert url here
# Method 1
html = requests.get(url, "html.parser")
soup = BeautifulSoup( html.text )
#Method 2
html2 = requests.get(url)
soup2 = BeautifulSoup( html.text, "html.parser")
Which method is correct ? Method 1 or Method 2 ? Should we put "html.parser" in requests.get() or BeautifulSoup() ?

Parsers are not a part of HTTP request.
It's a method to parse different types of document. So, during parsing the html document using BeautifulSoup you have to mention the parser
So, method 2 is correct.
DocString of BeautifulSoup constructor
:param markup: A string or a file-like object representing
markup to be parsed.
:param features: Desirable features of the parser to be used. This
may be the name of a specific parser ("lxml", "lxml-xml",
"html.parser", or "html5lib") or it may be the type of markup
to be used ("html", "html5", "xml"). It's recommended that you
name a specific parser, so that Beautiful Soup gives you the
same results across platforms and virtual environments.

If I understand correctly, your method 2 is correct and you would want to put it on the BeautifulSoup constructor because
Requests is separate from Beautiful Soup and I don't believe putting the "html.parser" on the constructor will do anything
You want to specify the parser for Beautiful Soup because it could be parsing things other than html e.g lxml's XML parser
Beautiful Soup Docs

Related

Extract a particular link present in each of the considered web pages

I'm having trouble extracting a particular link from each of the web pages I'm considering.
In particular, considering for example the following websites:
https://lefooding.com/en/restaurants/ezkia
https://lefooding.com/en/restaurants/tekes
I would like to know if there is a unique way to extract the field WEBSITE (above the map) shown in the table on the left of the page.
For the reported cases, I would like to extract the links:
https://www.ezkia-restaurant.fr/
https://www.tekesrestaurant.com/
There are no unique tags to refer to and this makes extraction difficult.
I've thought of a solution using the selector, but it doesn't seem to work. For the first link I have:
from bs4 import BeautifulSoup
import requests
url = "https://lefooding.com/en/restaurants/ezkia"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
data = soup.find("div", {"class": "e-rowContent"})
print(data)
but there is no trace of the link I need here. Does anyone know of a possible solution?
Try this:
import requests
from bs4 import BeautifulSoup
urls = [
"https://lefooding.com/en/restaurants/ezkia",
"https://lefooding.com/en/restaurants/tekes",
]
with requests.Session() as s:
for url in urls:
soup = [
link.strip() for link
in BeautifulSoup(
s.get(url).text, "lxml"
).select(".pageGuide__infos a")[-1]
]
print(soup)
Output:
['https://www.ezkia-restaurant.fr']
['https://www.tekesrestaurant.com/']

how to read links from a list with beautifulsoup?

I have a list with lots of links and I want to scrape them with beautifulsoup in Python 3
links is my list and it contains hundreds of urls. I have tried this code to scrape them all, but it's not working for some reason
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html',...]
raw = urlopen(i in links).read()
ufos_doc = BeautifulSoup(raw, "html.parser")
raw should be a list containing the data of each web-page. For each entry in raw, parse it and create a soup object. You can store each soup object in a list (I called it soups):
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
raw = [urlopen(i).read() for i in links]
soups = []
for page in raw:
soups.append(BeautifulSoup(page,'html.parser'))
You can then access eg. the soup object for the first link with soups[0].
Also, for fetching the response of each URL, consider using the requests module instead of urllib. See this post.
You need a Loop over the list links. If you have a lot of these to do, as mentioned in other answer, consider requests. With requests you can create a Session object which will allow you to re-use connection thereby more efficiently scraping
import requests
from bs4 import BeautifulSoup as bs
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
with requests.Session as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
#do something

Scraping with Python 3

Python3:
I'm new to scraping and to train I'm trying to get all the functions from this page:
https://www.w3schools.com/python/python_ref_functions.asp
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
print(soup.td.text)
# Output: abs()
no matter what I try, I only get the 1st one: abs()
Can you help me get them all from abs() to zip()?
To get all similar tags from any webpage use find_all() it returns list of item .
To get all single tag use find() it returns single item.
trick is to get parent tag of all elements which you need then use different methods of your choice and convenience Here you can find more.
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
#scrape table which contains all functions
tabledata = soup.find("table", attrs={"class": "w3-table-all notranslate"})
#print(tabledata)
#from table data get all a tags of functions
functions = tabledata.find_all("a")
#find_all() method returns list of elements iterate over it
for func in functions:
print(func.contents)
You can use find_all to iterate through ancestors that match the selector:
for tag in soup.find_all('td'):
print(tag.text)
This will include the Description column though, so you'll need to change this to ignore cells.
soup.td will only return the first matching tag.
So one solution would be:
for tag in soup.find_all('tr'):
cell = tag.td
if cell:
print(cell.text)

Python - Issue Scraping with BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
Process:
Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

How to take content from html tag style with Python 3?

Let's say I have this:
<bgi align="br" bgalp="100" bgc="ccffff" hasrec="0" ialp="49" isvid="1" tile="0" useimg="1"/>
I simply want to take "CCFFFF" from bgc, but don't know how to do it since this information varies. Was trying with re.compile but I'm really new in this...
One option is to use BeautifulSoup, and call the 'bgc' attribute of the 'bgi' tag:
from bs4 import BeautifulSoup
html_doc = """<bgi align="br" bgalp="100" bgc="ccffff" hasrec="0" ialp="49" isvid="1" tile="0" useimg="1"/>"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.bgi['bgc'])
output:
ccffff

Resources