Download Specific file from Website with BeautifulSoup - python-3.x

Following the documentation of BeautifulSoup, I am trying to download a specific file from a webpage. First trying to find the link that contains the file name:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find("a", text=re.compile("TASA_DOLAR_REFERENCIA_MC.xls"))
path = link.get('href')
print(f"{path}")
But with no success. Then trying to print every link on that page, I get no links:
import re
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.bancentral.gov.do/a/d/2538-mercado-cambiario")
parsed = BeautifulSoup(url.text, "html.parser")
link = parsed.find_all('a')
for links in parsed.find_all("a href"):
print(links.get('a href'))
It looks like the url of the file is dynamic, it adds a ?v=123456789 parameter to the end of the url, like the file version, that's why I need to download the file using the file name.
(Eg https://cdn.bancentral.gov.do/documents/estadisticas/mercado-cambiario/documents/TASA_DOLAR_REFERENCIA_MC.xls?v=1612902983415)
Thanks.

Actually you are dealing with a dynamic JavaScript page which is fully loaded via an XHR request to the following url once the page loads.
Below is a direct call to the back-end API which identify the request using page id which is 2538 and then we can load your desired url.
import requests
from bs4 import BeautifulSoup
def main(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:85.0) Gecko/20100101 Firefox/85.0'
}
with requests.Session() as req:
req.headers.update(headers)
data = {
"id": "2538",
"languageName": "es"
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.json()['result']['article']['content'], 'lxml')
target = soup.select_one('a[href*=TASA_DOLAR_REFERENCIA_MC]')['href']
r = req.get(target)
with open('data.xls', 'wb') as f:
f.write(r.content)
if __name__ == "__main__":
main('https://www.bancentral.gov.do/Home/GetContentForRender')

Related

Not getting proper soup object from website while scraping

I am trying to scrape yahoo finance website using BeautifulSoup and requests but not getting the correct soup. Its giving me a 404 page not found html code instead of giving me the original html code for the website . Here is my code.
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://finance.yahoo.com/quote/FBRX/profile?p=FBRX').text, 'lxml')
print(soup)
Here is my output:
Can you help me in scraping this website.
Try to set User-Agent HTTP header to obtain correct response from the server:
import requests
from bs4 import BeautifulSoup
url = "https://finance.yahoo.com/quote/FBRX/profile?p=FBRX"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:92.0) Gecko/20100101 Firefox/92.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
print(soup.h1.text)
Prints:
Forte Biosciences, Inc. (FBRX)

How to get a link with web scraping

I would like to create a web scraping with some Python library (Beautiful Soup, for example) to collect the YouTube links on this page:
https://www.last.fm/tag/rock/tracks
Basically, I want to download the title of the song, the name of the artist and the link to Youtube. Can anyone help me with some code?
Here's how you can do it:
from bs4 import BeautifulSoup
import requests
url = 'https://www.last.fm/tag/rock/tracks'
headers = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"
}
links = []
response = requests.get(url, headers=headers)
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
urls = soup.find_all(class_ = 'chartlist-name')
for url in urls:
relative_link = url.find('a')['href']
link = 'https://www.last.fm/' + relative_link
links.append(link)
print(links)
With the fuction soup.find_all you find all the tag with the class: "chartlist-name".
The for loop is used to remove the html tags and to append the links in the "links" list
In the future, provide some code to show what you have attempted.
I have expanded on Fabix answer. The following code gets the Youtube link, song name, and artist for all 20 pages on the source website.
from bs4 import BeautifulSoup
import requests
master_url = 'https://www.last.fm/tag/rock/tracks?page={}'
headers = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"
}
for i in range(1,20):
response = requests.get(master_url.format(i), headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
chart_items = soup.find_all(class_='chartlist-row')
for chart_item in chart_items:
youtube_link = chart_item.find('a')['href']
artist = chart_item.find('td', {'class':'chartlist-artist'}).find('a').text
song_name = chart_item.find('td', {'class': 'chartlist-name'}).find('a').text
print('{}, {}, {}'.format(song_name, artist, youtube_link))

Web-scraping Yell.com in Python

After reading a LOT, I have tried to do my first step in web scraping at yell website with urllib and requests but I get the same in both cases (404 not found).
The url is:
url = https://www.yell.com/
What I have tried:
urllib package
import urllib.request
f = urllib.request.urlopen(url)
print(f.read(100))
and
import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open(url)
requests package
url = 'www.yell.com'
response = requests.get(url)
and
headers = {'Accept': 'text/html'}
response = requests.get(url, headers=headers)
But i reach to the 404 error.
Try this using urllib
import urllib.request
url = 'https://www.yell.com/'
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
request = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read())
I would suggest you to use requests + beautifulsoup4
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
it will make your scraping life easier
#You can use also selenium to avoid http errors
from selenium import webdriver
from bs4 import BeautifulSoup
import urllib.request
main_url = 'https://www.yell.com/'
driver = webdriver.Chrome(r'write chromedriver path')
driver.get(main_url)
res = driver.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(res, 'html.parser')
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
request = urllib.request.Request(main_url, headers=headers)
response = urllib.request.urlopen(request)
print(response.read())

I want to open the first link that appear when i do a search on google

I want to get the first link from the html parser, but I'm getting anything(tried to print).
Also when i inspect the page on browser, the links are under class='r'
But when i print the soup.prettify(), and closely analyse then i find there is no class='r', instead class="BNeawe UPmit AP7Wnd".
Please help, thanks in advance!
import requests
import sys
import bs4
import webbrowser
def open_web(query):
res = requests.get('https://google.com/search?q=' + query)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
link_to_open = min(1, len(link_elements))
for i in range(link_to_open):
webbrowser.open('https://google.com' + link_elements[i].get('href'))
open_web('youtube')
The problem is that google serves different HTML when you don't specify User-Agent in headers. To add User-Agent to your request, put it in the headers= attribute:
import requests
import bs4
def open_web(query):
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('https://google.com/search?q=' + query, headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
print(link_elements)
open_web('youtube')
Prints:
[<a href="https://www.youtube.com/?gl=EE&hl=et" onmousedown="return rwt(this,'','','','1','AOvVaw2lWnw7oOhIzXdoFGYhvwv_','','2ahUKEwjove3h7onkAhXmkYsKHbWPAUYQFjAAegQIBhAC','','',event)"><h3 class="LC20lb">
... and so on.
You received a completely different HTML with different elements and selectors thus the output is empty. The reason why Google blocks your request is because default requests user-agent is python-requests and Google understands it and blocks it. Check what's your user-agent.
User-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Sometimes you can receive a different HTML, with different selectors.
You can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
If you want to get the very first link then use select_one() instead.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = soup.select_one('.yuRUbf a')['href']
print(link)
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means first index of search results
link = results['organic_results'][0]['link']
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Disclaimer, I work for SerpApi.

soup.select('.r a') in 'https://www.google.com/#q=vigilante+mic' gives empty list in python BeautifulSoup

I am using BeautifulSoup to extract all links from google search results page.
here's the snippet of the code:
import requests,bs4
res = requests.get('https://www.google.com/#q=vigilante+mic')
soup = bs4.BeautifulSoup(res.text)
linkElem = soup.select('.r a')
But soup.select('.r a') is returning an empty list
Thanks
That's because of the url you are using:
https://www.google.com/#q=vigilante+mic
Is a javascript version of the search. If you curl it you will see there are no answers in the html. This happens because the results are fetched through javascript and requests doesn't handle that.
Try this other url (that is not javascript based):
https://www.google.com/search?q=vigilante+mic
Now it works:
import requests,bs4
res = requests.get('https://www.google.com/search?q=vigilante+mic')
soup = bs4.BeautifulSoup(res.text)
linkElem = soup.select('.r a')
Besides changing #q= to ?q=, one of the reasons it's empty is because there's no user-agent specified thus Google blocks your request. What is my user-agent?
Code and example in the online IDE that scrapes more:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----------
'''
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisa.gov/cybersecurity
https://onlinedegrees.und.edu/blog/types-of-cyber-security-threats/
https://digitalguardian.com/blog/what-cyber-security
https://staysafeonline.org/
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you only need to iterate over JSON string without figuring out how to extract something or find CSS that works.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "cyber security",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
-----------------
'''
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisa.gov/cybersecurity
https://onlinedegrees.und.edu/blog/types-of-cyber-security-threats/
https://digitalguardian.com/blog/what-cyber-security
https://staysafeonline.org/
'''
P.S - I wrote a bit more detailed blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

Resources