How to get a link with web scraping

How to get a link with web scraping - python-3.x

I would like to create a web scraping with some Python library (Beautiful Soup, for example) to collect the YouTube links on this page:
https://www.last.fm/tag/rock/tracks
Basically, I want to download the title of the song, the name of the artist and the link to Youtube. Can anyone help me with some code?

Here's how you can do it:
from bs4 import BeautifulSoup
import requests
url = 'https://www.last.fm/tag/rock/tracks'
headers = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"
}
links = []
response = requests.get(url, headers=headers)
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
urls = soup.find_all(class_ = 'chartlist-name')
for url in urls:
relative_link = url.find('a')['href']
link = 'https://www.last.fm/' + relative_link
links.append(link)
print(links)
With the fuction soup.find_all you find all the tag with the class: "chartlist-name".
The for loop is used to remove the html tags and to append the links in the "links" list

In the future, provide some code to show what you have attempted.
I have expanded on Fabix answer. The following code gets the Youtube link, song name, and artist for all 20 pages on the source website.
from bs4 import BeautifulSoup
import requests
master_url = 'https://www.last.fm/tag/rock/tracks?page={}'
headers = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"
}
for i in range(1,20):
response = requests.get(master_url.format(i), headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
chart_items = soup.find_all(class_='chartlist-row')
for chart_item in chart_items:
youtube_link = chart_item.find('a')['href']
artist = chart_item.find('td', {'class':'chartlist-artist'}).find('a').text
song_name = chart_item.find('td', {'class': 'chartlist-name'}).find('a').text
print('{}, {}, {}'.format(song_name, artist, youtube_link))

Related

WebScraping / Identical sites not working?

i would like to scrape the header-element from these both links -
For me this 2 sites look absolute identical - pics see below
Why is only the scraping for the second link working and not for the first?
import time
import requests
from bs4 import BeautifulSoup
# not working
link = "https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4"
page = requests.get (link)
time.sleep (1)
soup = BeautifulSoup (page.content, "html.parser")
erg = soup.find("header")
print(f"First Link: {erg}")
# working
link = "https://apps.apple.com/us/app/jackpot-boom-casino-slots/id1554995201?uo=4"
page = requests.get (link)
time.sleep (1)
soup = BeautifulSoup (page.content, "html.parser")
erg = soup.find("header")
print(f"Second Link: {len(erg)}")
Working:
Not Working:

The page is sometimes loaded by JavaScript, so request won't support it.
You can use a while loop to check if header appears in the soup and then break
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"
}
link = "https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4"
while True:
soup = BeautifulSoup(requests.get(link).content, "html.parser")
header = soup.find("header")
if header:
break
print(header)

Try this to get whatever fields you wish to grab from those links. curently it fetches the title. You can modify res.json()['data'][0]['attributes']['name'] to grab any field of your interest. Mkae sure to put the urls within this list urls_to_scrape.
import json
import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote
urls_to_scrape = {
'https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4',
'https://apps.apple.com/us/app/jackpot-boom-casino-slots/id1554995201?uo=4'
}
base_url = 'https://apps.apple.com/us/app/bingo-story-live-bingo-games/id1179108009?uo=4'
link = 'https://amp-api.apps.apple.com/v1/catalog/US/apps/{}'
params = {
'platform': 'web',
'additionalPlatforms': 'appletv,ipad,iphone,mac',
'extend': 'customPromotionalText,customScreenshotsByType,description,developerInfo,distributionKind,editorialVideo,fileSizeByDevice,messagesScreenshots,privacy,privacyPolicyText,privacyPolicyUrl,requirementsByDeviceFamily,supportURLForLanguage,versionHistory,websiteUrl',
'include': 'genres,developer,reviews,merchandised-in-apps,customers-also-bought-apps,developer-other-apps,app-bundles,top-in-apps,related-editorial-items',
'l': 'en-us',
'limit[merchandised-in-apps]': '20',
'omit[resource]': 'autos',
'sparseLimit[apps:related-editorial-items]': '5'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'
res = s.get(base_url)
soup = BeautifulSoup(res.text,"lxml")
token_raw = soup.select_one("[name='web-experience-app/config/environment']").get("content")
token = json.loads(unquote(token_raw))['MEDIA_API']['token']
s.headers['Accept'] = 'application/json'
s.headers['Referer'] = 'https://apps.apple.com/'
s.headers['Authorization'] = f'Bearer {token}'
for url in urls_to_scrape:
id_ = url.split("/")[-1].strip("id").split("?")[0]
res = s.get(link.format(id_),params=params)
title = res.json()['data'][0]['attributes']['name']
print(title)

I want to open the first link that appear when i do a search on google

I want to get the first link from the html parser, but I'm getting anything(tried to print).
Also when i inspect the page on browser, the links are under class='r'
But when i print the soup.prettify(), and closely analyse then i find there is no class='r', instead class="BNeawe UPmit AP7Wnd".
Please help, thanks in advance!
import requests
import sys
import bs4
import webbrowser
def open_web(query):
res = requests.get('https://google.com/search?q=' + query)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
link_to_open = min(1, len(link_elements))
for i in range(link_to_open):
webbrowser.open('https://google.com' + link_elements[i].get('href'))
open_web('youtube')

The problem is that google serves different HTML when you don't specify User-Agent in headers. To add User-Agent to your request, put it in the headers= attribute:
import requests
import bs4
def open_web(query):
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
res = requests.get('https://google.com/search?q=' + query, headers=headers)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, "html.parser")
link_elements = soup.select('.r a')
print(link_elements)
open_web('youtube')
Prints:
[<a href="https://www.youtube.com/?gl=EE&hl=et" onmousedown="return rwt(this,'','','','1','AOvVaw2lWnw7oOhIzXdoFGYhvwv_','','2ahUKEwjove3h7onkAhXmkYsKHbWPAUYQFjAAegQIBhAC','','',event)"><h3 class="LC20lb">
... and so on.

You received a completely different HTML with different elements and selectors thus the output is empty. The reason why Google blocks your request is because default requests user-agent is python-requests and Google understands it and blocks it. Check what's your user-agent.
User-agent let identifies the browser, its version number, and its host operating system that representing a person (browser) in a Web context that lets servers and network peers identify if it's a bot or not.
Sometimes you can receive a different HTML, with different selectors.
You can pass URL params as a dict() which is more readable and requests do everything for you automatically (same goes for adding user-agent into headers):
params = {
"q": "My query goes here"
}
requests.get("YOUR_URL", params=params)
If you want to get the very first link then use select_one() instead.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "My query goes here"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
link = soup.select_one('.yuRUbf a')['href']
print(link)
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data you want from JSON string rather than figuring out how to extract, maintain or bypass blocks from Google.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "My query goes here",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] means first index of search results
link = results['organic_results'][0]['link']
# https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
Disclaimer, I work for SerpApi.

Scraping store locations from a complex website

I am new to web scraping and I need to scrape store locations from the given website. The information I need includes location title, address, city, state, country, phone. So far I have extracted the webpage but I don't know how to go forward
url = 'https://www.rebounderz.com/all-locations/'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102
Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
Please guide me how can I get the required information. I have searched other answers and looked into tutorials too but the structure of this website has made me confused.

import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
url = "https://www.rebounderz.com/all-locations/"
context = ssl._create_unverified_context()
headers = {}
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'
request = urllib.request.Request(url, headers=headers)
html = urlopen(request, context=context)
soup = BeautifulSoup(html, 'lxml')
divs = soup.find_all('div', {"class":"size1of3"})
for div in divs:
print(div.find("h5").get_text())
print(div.find("p").get_text())

how to extract links and handle the page as it is loading again and again using python beautifulsoup

tring to extract links and want to handle loading. but not the links even.
code:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.indiabusinessguide.in/business-categories/agriculture/agricultural-equipment.html')
soup = BeautifulSoup(r.text,'lxml')
links = soup.find_all('a',class_='link_orange')
for link in links:
print(link['href'])
please help me to handle this loading and extraction of links.

Try using the lxml library. Response is received by posting a request to the url using Requests.
import requests
import lxml
from lxml import html
contact_list = []
def scrape(url, pages):
for page in range(1, pages):
headers = {
"Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Cookie": "PHPSESSID=2q0tk3fi1kid0gbdfboh94ed56",
}
data = {
"page": f"{page}"
}
r = requests.post(url, headers=headers, data=data)
tree = html.fromstring(r.content)
links= tree.xpath('//a[#class="link_orange"]')
for link in links:
# print(link.get('href'))
contact_list.append(link.get('href'))
url = "http://www.indiabusinessguide.in/ajax_advertiselist.php"
scrape(url, 10)
print(contact_list)
print(len(contact_list))

soup.select('.r a') in 'https://www.google.com/#q=vigilante+mic' gives empty list in python BeautifulSoup

I am using BeautifulSoup to extract all links from google search results page.
here's the snippet of the code:
import requests,bs4
res = requests.get('https://www.google.com/#q=vigilante+mic')
soup = bs4.BeautifulSoup(res.text)
linkElem = soup.select('.r a')
But soup.select('.r a') is returning an empty list
Thanks

That's because of the url you are using:
https://www.google.com/#q=vigilante+mic
Is a javascript version of the search. If you curl it you will see there are no answers in the html. This happens because the results are fetched through javascript and requests doesn't handle that.
Try this other url (that is not javascript based):
https://www.google.com/search?q=vigilante+mic
Now it works:
import requests,bs4
res = requests.get('https://www.google.com/search?q=vigilante+mic')
soup = bs4.BeautifulSoup(res.text)
linkElem = soup.select('.r a')

Besides changing #q= to ?q=, one of the reasons it's empty is because there's no user-agent specified thus Google blocks your request. What is my user-agent?
Code and example in the online IDE that scrapes more:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----------
'''
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisa.gov/cybersecurity
https://onlinedegrees.und.edu/blog/types-of-cyber-security-threats/
https://digitalguardian.com/blog/what-cyber-security
https://staysafeonline.org/
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you only need to iterate over JSON string without figuring out how to extract something or find CSS that works.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "cyber security",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
-----------------
'''
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisa.gov/cybersecurity
https://onlinedegrees.und.edu/blog/types-of-cyber-security-threats/
https://digitalguardian.com/blog/what-cyber-security
https://staysafeonline.org/
'''
P.S - I wrote a bit more detailed blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to get a link with web scraping - python-3.x

Related

WebScraping / Identical sites not working?

I want to open the first link that appear when i do a search on google

Scraping store locations from a complex website

how to extract links and handle the page as it is loading again and again using python beautifulsoup

soup.select('.r a') in 'https://www.google.com/#q=vigilante+mic' gives empty list in python BeautifulSoup

Categories

Resources