Extract a particular link present in each of the considered web pages - python-3.x

I'm having trouble extracting a particular link from each of the web pages I'm considering.
In particular, considering for example the following websites:
https://lefooding.com/en/restaurants/ezkia
https://lefooding.com/en/restaurants/tekes
I would like to know if there is a unique way to extract the field WEBSITE (above the map) shown in the table on the left of the page.
For the reported cases, I would like to extract the links:
https://www.ezkia-restaurant.fr/
https://www.tekesrestaurant.com/
There are no unique tags to refer to and this makes extraction difficult.
I've thought of a solution using the selector, but it doesn't seem to work. For the first link I have:
from bs4 import BeautifulSoup
import requests
url = "https://lefooding.com/en/restaurants/ezkia"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
data = soup.find("div", {"class": "e-rowContent"})
print(data)
but there is no trace of the link I need here. Does anyone know of a possible solution?

Try this:
import requests
from bs4 import BeautifulSoup
urls = [
"https://lefooding.com/en/restaurants/ezkia",
"https://lefooding.com/en/restaurants/tekes",
]
with requests.Session() as s:
for url in urls:
soup = [
link.strip() for link
in BeautifulSoup(
s.get(url).text, "lxml"
).select(".pageGuide__infos a")[-1]
]
print(soup)
Output:
['https://www.ezkia-restaurant.fr']
['https://www.tekesrestaurant.com/']

Related

how to read links from a list with beautifulsoup?

I have a list with lots of links and I want to scrape them with beautifulsoup in Python 3
links is my list and it contains hundreds of urls. I have tried this code to scrape them all, but it's not working for some reason
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html',...]
raw = urlopen(i in links).read()
ufos_doc = BeautifulSoup(raw, "html.parser")
raw should be a list containing the data of each web-page. For each entry in raw, parse it and create a soup object. You can store each soup object in a list (I called it soups):
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
raw = [urlopen(i).read() for i in links]
soups = []
for page in raw:
soups.append(BeautifulSoup(page,'html.parser'))
You can then access eg. the soup object for the first link with soups[0].
Also, for fetching the response of each URL, consider using the requests module instead of urllib. See this post.
You need a Loop over the list links. If you have a lot of these to do, as mentioned in other answer, consider requests. With requests you can create a Session object which will allow you to re-use connection thereby more efficiently scraping
import requests
from bs4 import BeautifulSoup as bs
links= ['http://www.nuforc.org/webreports/ndxe201904.html',
'http://www.nuforc.org/webreports/ndxe201903.html',
'http://www.nuforc.org/webreports/ndxe201902.html',
'http://www.nuforc.org/webreports/ndxe201901.html',
'http://www.nuforc.org/webreports/ndxe201812.html',
'http://www.nuforc.org/webreports/ndxe201811.html']
with requests.Session as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
#do something

Scrape original links and headlines from Facebook posts

I need to gather some information which is not provided by Facebook Analytics. For example, the original url and headline of an article promoted on Facebook as a link post. This info is buried in the html code of a Facebook post but I struggle to dig it out. Will appreciate your help.
Let's take this example: https://www.facebook.com/bbcnews/posts/10156428513547217
I identified classes for a link (bbc.in...): "_6ks"
and headline: 'mbs _6m6 _2cnj _5s6c'
The code below doesn't return anything:
from bs4 import BeautifulSoup
import requests
link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
r = requests.get(link)
soup = BeautifulSoup(r.content, "lxml")
for paragraph in soup.find_all("div", class_="_6ks"):
for a in paragraph("a"):
print(a.get('href'))
for paragraph in soup.find_all("div", class_='mbs _6m6 _2cnj _5s6c'):
for a in paragraph("a"):
print(a.get('hover'))
Another way to achieve the same would be something like below:
from bs4 import BeautifulSoup
import requests
link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
res = requests.get(link,headers={'User-Agent':'Mozilla/5.0'})
comment = res.text.replace("-->", "").replace("<!--", "")
soup = BeautifulSoup(comment, "lxml")
items = soup.select_one('.mbs a')
print(items.get("href")+"\n",items.text)
The reason you are not able to getting any output is b'coz both of those divs are placed cleverly placed within comment tags <!-- --> . Comments are ignored by the parsers. If you print the soup, both of the divs are present but within the comment tags.
We can get the comments and then make a new soup using that to bypass this.
from bs4 import BeautifulSoup
from bs4 import Comment
import requests
link = 'https://www.facebook.com/bbcnews/posts/10156428513547217'
headers={'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'}
r = requests.get(link,headers=headers)
soup = BeautifulSoup(r.content, "lxml")
comments=soup.find_all(string=lambda text:isinstance(text,Comment))
soup=BeautifulSoup(comments[0], "lxml")
for paragraph in soup.find_all("div", class_="_6ks"):
for a in paragraph("a"):
print(a.get('href'))
print('-------------------------------------------------------------------')
for paragraph in soup.find_all("div", class_='mbs _6m6 _2cnj _5s6c'):
for a in paragraph("a"):
print(a.text)
Output
https://l.facebook.com/l.php?u=https%3A%2F%2Fbbc.in%2F2FP4EgR&h=AT3jWrl9cgJEY-8NBLgbvOEtDSZ8dBABo4TJaVJ66QBbWdCsBypvAkN6MD7VhJoOgy_LGJeomQAlcwtex_Ab-7TvWXhKkLB1m_TjzxOSk3R2uP8qTUL3aTTj4Pcz2ZSZunWxZsPtOlJSpay_AtQfNTuLTUQ80OrtvRiDMs8duN3b27IH2UPnGThQ_YGJAcYJdPE3R9JbyxSQNhJ8yTmaRJe8pMNbgVkentXU4p3liys2IQvphwRd0V8ANmo-4xvKj1dRADHy3hOyUkcv_L2u8Z4WpLx1AZQCTitvfSLvhQRMZ0cK1vIjkuv3gfurRf250p3D54GxQZIsVLymDzNtLbOnigIuFRHfQFAUSBDzJGTqQB3hs4lilYyFXIqaC2cdXwDp8GDrmYbgRWmEMmN6A5fHDdRlF4m7MXJO0vJ_7uqkh0TAdcvTSc0dqt5Wv3wOoEN5S1b2ddLZOp3DFwApAGkSHsOtW7Pjc-STFljuV045ERsUWUbmnALSl9vxB6tiZ0poa3aGxZqnlFqsaTB-A8plwCWp5ed9JALlurBco447aELbpuRexqoOajxTvS_yW9BdSXaufzpbPFKaNt5go7uf4GjdekpITCApJo2JoAOzzsfKHdg1MXasOCw
-------------------------------------------------------------------
MPs put forward rival Brexit plans

web scraping with beautiful soup in python

I want to crawl the homepage of youtube to pull out all the links of videos. Following is the code
from bs4 import BeautifulSoup
import requests
s='https://www.youtube.com/'
html=requests.get(s)
html=html.text
s=BeautifulSoup(html,features="html.parser")
for e in s.find_all('a',{'id':'video-title'}):
link=e.get('href')
text=e.string
print(text)
print(link)
print()
Nothing is happenning when I run the above code. It seems like the id is not getting discovered. What am I doing wrong
It is because you are not getting the same HTML as your browser have.
import requests
from bs4 import BeautifulSoup
s = requests.get("https://youtube.com").text
soup = BeautifulSoup(s,'lxml')
print(soup)
Save this code's output to a file named test.html and run. You will see that it is not the same as the browser's, as it looks corrupted.
See these questions below.
HTML in browser doesn't correspond to scraped data in python
Python requests not giving me the same HTML as my browser is
Basically, I recommend you to use Selenium Webdriver as it reacts as a browser.
Yes, this is a strange scrape, but if you scrape at the 'div id="content"' level, you are able to get the data you are requesting. I was able to get the titles of each video, but it appears youtube has some rate limiting or throttling, so I do not think you will be able to get ALL of the titles and links. At any rate, below is what I got working for the titles:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')
for each in links:
print(each.text)
May be this could help for scraping all videos from youtube home page,
from bs4 import BeautifulSoup
import requests
r = 'https://www.youtube.com/'
html = requests.get(r)
all_videos = []
soup = BeautifulSoup(html.text, 'html.parser')
for i in soup.find_all('a'):
if i.has_attr('href'):
text = i.attrs.get('href')
if text.startswith('/watch?'):
urls = r+text
all_videos.append(urls)
print('Total Videos', len(all_videos))
print('LIST OF VIDEOS', all_videos)
This code snippet will selects all links from youtube.com homepage that contains /watch? in their href attribute (links to videos):
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://www.youtube.com/').text, 'lxml')
for a in soup.select('a[href*="/watch?"]'):
print('https://www.youtube.com{}'.format(a['href']))
Prints:
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU
...and so on

BeautifulSoup returns urls of pages on same website shortened

My code for reference:
import httplib2
from bs4 import BeautifulSoup
h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
urls.append(tag['href'])
responses = []
contents = []
for url in urls:
try:
response1, content1 = h.request(url)
responses.append(response1)
contents.append(content1)
except:
pass
The idea is, I get the payload of a webpage, and then scrape that for hyperlinks. One of the links is to yahoo.com, the other to 'http://csb.stanford.edu/class/public/index.html'
However the result I'm getting from BeautifulSoup is:
>>> urls
['http://www.yahoo.com/', '../../index.html']
This presents a problem, because the second part of the script cannot be executed on the second, shortened url. Is there any way to make BeautifulSoup retrieve the full url?
That's because the link on the webpage is actually of that form. The HTML from the page is:
<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>
This is called a relative link.
To convert this to an absolute link, you can use urljoin from the standard library.
from urllib.parse import urljoin # Python3
urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
'../../index.html')
# returns http://csb.stanford.edu/class/public/index.html

Asynchronous scraping with Python: grequests and Beautifulsoup4

I am trying to scrape this site . I managed to do it by using urllib and beautifulsoup. But urllib is too slow. I want to have asynchronous requests because the urls are thousands. I found that a nice package is grequests.
example:
import grequests
from bs4 import BeautifulSoup
pages = []
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
for i in range(1,1000):
pages.append(page)
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
page = page + "/offset_{}".format(i*10)
rs = (grequests.get(item) for item in pages)
a=grequests.map(rs)
The problem is that I don't know how to continue and use beautifulsoup. So as to get the html code of every page.
It would be nice to hear your ideas. Thank you!
Refer to the script below, also check the link of the source. It will help.
reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
results = soup.find_all('a', attrs={"class":'product__list-name'})
print(results[0].text)
prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
print(prices[0].text)
discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
print(discount[0].text)
Source: https://blog.datahut.co/asynchronous-web-scraping-using-python/

Resources