how to read links from a list with beautifulsoup? - python-3.x

I have a list with lots of links and I want to scrape them with beautifulsoup in Python 3
links is my list and it contains hundreds of urls. I have tried this code to scrape them all, but it's not working for some reason
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html',...]
raw = urlopen(i in links).read()
ufos_doc = BeautifulSoup(raw, "html.parser")

raw should be a list containing the data of each web-page. For each entry in raw, parse it and create a soup object. You can store each soup object in a list (I called it soups):
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
raw = [urlopen(i).read() for i in links]
soups = []
for page in raw:
soups.append(BeautifulSoup(page,'html.parser'))
You can then access eg. the soup object for the first link with soups[0].
Also, for fetching the response of each URL, consider using the requests module instead of urllib. See this post.

You need a Loop over the list links. If you have a lot of these to do, as mentioned in other answer, consider requests. With requests you can create a Session object which will allow you to re-use connection thereby more efficiently scraping
import requests
from bs4 import BeautifulSoup as bs
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
with requests.Session as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
#do something

Related

Extract a particular link present in each of the considered web pages

I'm having trouble extracting a particular link from each of the web pages I'm considering.
In particular, considering for example the following websites:
https://lefooding.com/en/restaurants/ezkia
https://lefooding.com/en/restaurants/tekes
I would like to know if there is a unique way to extract the field WEBSITE (above the map) shown in the table on the left of the page.
For the reported cases, I would like to extract the links:
https://www.ezkia-restaurant.fr/
https://www.tekesrestaurant.com/
There are no unique tags to refer to and this makes extraction difficult.
I've thought of a solution using the selector, but it doesn't seem to work. For the first link I have:
from bs4 import BeautifulSoup
import requests
url = "https://lefooding.com/en/restaurants/ezkia"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
data = soup.find("div", {"class": "e-rowContent"})
print(data)
but there is no trace of the link I need here. Does anyone know of a possible solution?
Try this:
import requests
from bs4 import BeautifulSoup
urls = [
"https://lefooding.com/en/restaurants/ezkia",
"https://lefooding.com/en/restaurants/tekes",
]
with requests.Session() as s:
for url in urls:
soup = [
link.strip() for link
in BeautifulSoup(
s.get(url).text, "lxml"
).select(".pageGuide__infos a")[-1]
]
print(soup)
Output:
['https://www.ezkia-restaurant.fr']
['https://www.tekesrestaurant.com/']

Where do we put the "html.parser" argument when web scraping?

Look at the following snippet of code
import requests
from bs4 import BeautifulSoup
url = #Insert url here
# Method 1
html = requests.get(url, "html.parser")
soup = BeautifulSoup( html.text )
#Method 2
html2 = requests.get(url)
soup2 = BeautifulSoup( html.text, "html.parser")
Which method is correct ? Method 1 or Method 2 ? Should we put "html.parser" in requests.get() or BeautifulSoup() ?
Parsers are not a part of HTTP request.
It's a method to parse different types of document. So, during parsing the html document using BeautifulSoup you have to mention the parser
So, method 2 is correct.
DocString of BeautifulSoup constructor
:param markup: A string or a file-like object representing
markup to be parsed.
:param features: Desirable features of the parser to be used. This
may be the name of a specific parser ("lxml", "lxml-xml",
"html.parser", or "html5lib") or it may be the type of markup
to be used ("html", "html5", "xml"). It's recommended that you
name a specific parser, so that Beautiful Soup gives you the
same results across platforms and virtual environments.
If I understand correctly, your method 2 is correct and you would want to put it on the BeautifulSoup constructor because
Requests is separate from Beautiful Soup and I don't believe putting the "html.parser" on the constructor will do anything
You want to specify the parser for Beautiful Soup because it could be parsing things other than html e.g lxml's XML parser
Beautiful Soup Docs

Unable to use the site search function

I am trying to use the built-in search function from the site but I keep getting results from the main page. Not sure what I am doing wrong.
import requests
from bs4 import BeautifulSoup
body = {'input':'ferris'} # <-- also have tried'query'
con = requests.post('http://www.collegedata.com/', data=body)
soup = BeautifulSoup(con.content, 'html.parser')
products = soup.findAll('div', {'class': 'schoolCityCol'})
print(soup)
print (products)
You have 2 issues in your code:
POST url is incorrect. You should correct this:
con = session.post('http://www.collegedata.com/cs/search/college/college_search_tmpl.jhtml', data=body)
Your POST data is incorrect too.
body = {'method':'submit', 'collegeName':'ferris', 'searchType':'1'}
You can use Developer tools in any browser (Chrome preferably) and check POST url and data on page Network.

Python - Issue Scraping with BeautifulSoup

I'm trying to scrape the Stack Overflow jobs page using Beautiful Soup 4 and URLLIB as a personal project. I'm facing an issue where I'm trying to scrape all the links to the 50 jobs listed on each page. I'm using a regex to identify these links. Even though I reference the tag properly, I am facing these two specific issues:
Instead of the 50 links clearly visible in the source code, I get only 25 results each time as my output(after accounting for an removing an initial irrelevant link)
There's a difference between how the links are ordered in the source code and my output.
Here's my code. Any help on this will be greatly appreciated:
import bs4
import urllib.request
import re
#Obtaining source code to parse
sauce = urllib.request.urlopen('https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()
soup = bs4.BeautifulSoup(sauce, 'html.parser')
snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)
print(strsnippet)
joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)
print("Urls: ",joburls)
print(len(joburls))
Disclaimer: I did some asking of my own for a part of this answer.
from bs4 import BeautifulSoup
import requests
import json
# note: link is slightly different; yours just redirects here
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(len(urls))
50
Process:
Use soup.find rather than soup.find_all. This will give a JSON bs4.element.Tag
json.loads(s.text) is a nested dict. Access the values for itemListElement key to get a dict of urls, and convert to list.

Asynchronous scraping with Python: grequests and Beautifulsoup4

I am trying to scrape this site . I managed to do it by using urllib and beautifulsoup. But urllib is too slow. I want to have asynchronous requests because the urls are thousands. I found that a nice package is grequests.
example:
import grequests
from bs4 import BeautifulSoup
pages = []
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
for i in range(1,1000):
pages.append(page)
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
page = page + "/offset_{}".format(i*10)
rs = (grequests.get(item) for item in pages)
a=grequests.map(rs)
The problem is that I don't know how to continue and use beautifulsoup. So as to get the html code of every page.
It would be nice to hear your ideas. Thank you!
Refer to the script below, also check the link of the source. It will help.
reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
results = soup.find_all('a', attrs={"class":'product__list-name'})
print(results[0].text)
prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
print(prices[0].text)
discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
print(discount[0].text)
Source: https://blog.datahut.co/asynchronous-web-scraping-using-python/

Resources