Web-Scraping Amazon products - python-3.x

I want to scrape multiple Amazon product pages. If I print the title, for instance, it does not print the title for both links or ASINs, but only for the latter one. How can I print the title of both ASINs?
ASIN = ['B09C1Q9P1N','B096W87PPJ']
for a in ASIN:
url = 'https://www.amazon.de/dp/' + a + '/'
driver.get(url)
urls = 'https://www.amazon.de/dp/B09C1Q9P1N/','https://www.amazon.de/dp/B096W87PPJ/'
soupa = soup(driver.page_source, 'html.parser')
title = soupa.find(id='productTitle').text.replace('\n', '').replace(' ', '')

I am not sure I fully understand your question but I gather you are using selenium? Have you tried putting it into a for loop for each URL such as:
import bs4
from selenium import webdriver
# Your webdriver code here
urls = ["https://www.amazon.de/dp/B09C1Q9P1N",
"https://www.amazon.de/dp/B096W87PPJ"]
for url in urls:
html = driver.page_source
soup = bs4.BeautifulSoup(html.text, "html.parser")
for title in soup.find_all("span", {"id": "productTitle"}):
print(title)

Related

How To Refactor Web Scraping Code In Python

I am web scraping data from the below url and was able to do it correctly but i am looking for more reliable and beautiful way to do it
import pandas as pd
from bs4 import BeautifulSoup
import requests
pages = list(range(1, 548))
list_of_url = []
for page in pages:
URL = "https://www.stats.gov.sa/ar/isic4?combine=&combine_1=All&items_per_page=5" + "&page=" + str(page)
#print (URL)
list_of_url.append(URL)
print(list_of_url)
list_activities = []
#page_number = 1
for url in list_of_url:
URL = url
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('div', class_='view-content')
#print(results.prettify())
try:
activities = results.find_all("tr", class_=["views-row-first odd","even","odd","even","views-row-last odd"])
except:
print("in the activities line thisis a pad url", URL)
continue
try:
for activity in activities:
activity_section = activity.find('td', class_='views-field views-field-field-chapter-desc-en-et').text.strip()
activity_name = activity.find("td", class_="views-field views-field-field-activity-description-en-et").text.strip()
activity_code = activity.find("td", class_="views-field views-field-field-activity-code active").text.strip()
list_activities.append([activity_section,activity_name,activity_code])
except:
print("url not founf")
continue
page_number += 1
df = pd.DataFrame(list_activities, columns=["activity_section", "activity_name", "activity_code"])
df.head()
I am web scraping data from the below url and was able to do it correctly but i am looking for more reliable and beautiful way to do it
Here is a shorter version for your code:
import pandas as pd
from bs4 import BeautifulSoup
import requests
list_activities = []
URLS = [f'https://www.stats.gov.sa/ar/isic4?combine=&combine_1=All&items_per_page=5&page={page}' for page in range(1,3)]
for URL in URLS:
page = requests.get(URL)
soup = BeautifulSoup(page.text, "html.parser")
results = soup.find('div', class_='view-content')
activities = results.find_all("tr", class_=["views-row-first odd","even","odd","even","views-row-last odd"])
list_activities += [[
activity.find('td', class_='views-field views-field-field-chapter-desc-en-et').text.strip(),
activity.find("td", class_="views-field views-field-field-activity-description-en-et").text.strip(),
activity.find("td", class_="views-field views-field-field-activity-code active").text.strip()
] for activity in activities]
df = pd.DataFrame(list_activities, columns=["activity_section", "activity_name", "activity_code"])
df.head()
However, as an engineer at WebScrapingAPI I would recommend you implement a stealthier scraper if you want to scrape this website on the long run. As per my testing, it does not feature any known bot detection providers right now. But being a government website it might use a private detection system.

How to add new data to an array in Python?

What is wrong in my code ?
I would like to add all urls from website to the array
import requests
from bs4 import BeautifulSoup
url = 'https://globalthiel.pl'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
url.append(link.get('href').text)
for i in urls:
print(i, end="")
You have a typo in your code. Replace
url.append(link.get('href').text)
with
urls.append(link.get('href').text)

I make a list of URL of different pages for scraping the data. Can anyone tell me that is there any way to automate this process?

from bs4 import BeautifulSoup
import requests
urls = ['https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=2&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=3&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=4&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=5&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=6&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=7&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=8&status=all&timeperiod=0']
for url in URLs:
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'lxml')
restaurants = soup.find_all('div', class_ = 'categoryBusinessListWrapper___14CgD')
for index, restaurant in enumerate(restaurants):
tags = restaurant.find_all('a', class_ = 'internal___1jK0Z wrapper___26yB4')
for tag in tags:
restaurant_name = tag.find('div', class_ = 'businessTitle___152-c').text.split(',')[0]
ratings = tag.find('div', class_ = 'textRating___3F1NO')
location = tag.find('span', class_ = 'locationZipcodeAndCity___33EfU')
more_info = tag['href']
As you can see that I create a URLs list to store the URL of different pages on this website. Is there any process to automate this? I use BeautifulSoup and the request module for scraping. I want to know that if there is any process to automate the URL accessing for different pages.
You can look at the pagination at the bottom of the page and use list comprehension to create those links:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&status=all&timeperiod=0'
regex = re.compile('pagination')
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
pages = len(soup.find_all("a", {"class": regex}))
links = ['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page={page}&status=all&timeperiod=0'.format(page=page+1) for page in range(0,pages) ]
Output:
print (links)
['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=1&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=2&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=3&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=4&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=5&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=6&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=7&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=8&status=all&timeperiod=0']

BeautifulSoup python: Get the text with no tags and get the adjacent links

I am trying to extract the movie titles and links for it from this site
from bs4 import BeautifulSoup
from requests import get
link = "https://tamilrockerrs.ch"
r = get(link).content
#r = open('json.html','rb').read()
b = BeautifulSoup(r,'html5lib')
a = b.findAll('p')[1]
But the problem is there is no tag for the titles. I can't extract the titles and if I could do that how can I bind the links and title together.
Thanks in Advance
You can find title and link by this way.
from bs4 import BeautifulSoup
import requests
url= "http://tamilrockerrs.ch"
response= requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
data = soup.find_all('div', {"class":"title"})
for film in data:
print("Title:", film.find('a').text) # get the title here
print("Link:", film.find('a').get("href")) #get the link here

How can I get the href of anchor tag using Beautiful Soup?

I am trying to get the href of anchor tag of the very first video search on YouTube using Beautiful Soup. I am searching it by using the "a" and class_="yt-simple-endpoint style-scope ytd-video-renderer".
But I am getting None output:
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
# print(soup2.prettify())
a =soup.findAll("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
a_fin = soup.find("a", class_="compact-media-item-image")
#
print(a)
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing").text
soup = BeautifulSoup(source,'lxml')
first_serach_result_link = soup.findAll('a',attrs={'class':'yt-uix-tile-link'})[0]['href']
heavily inspired by
this answer
Another option is to render the page first with Selenium.
import bs4
from selenium import webdriver
url = 'https://www.youtube.com/results?search_query=MP+election+results+2018%3A+BJP+minister+blames+conspiracy+as+reason+while+losing'
browser = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
browser.get(url)
source = browser.page_source
soup = bs4.BeautifulSoup(source,'html.parser')
hrefs = soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer")
for a in hrefs:
print (a['href'])
Output:
/watch?v=Jor09n2IF44
/watch?v=ym14AyqJDTg
/watch?v=g-2V1XJL0kg
/watch?v=eeVYaDLC5ik
/watch?v=StI92Bic3UI
/watch?v=2W_4LIAhbdQ
/watch?v=PH1WZPT5IKw
/watch?v=Au2EH3GsM7k
/watch?v=q-j1HEnDn7w
/watch?v=Usjg7IuUhvU
/watch?v=YizmwHibomQ
/watch?v=i2q6Fm0E3VE
/watch?v=OXNAMyEvcH4
/watch?v=vdcBtAeZsCk
/watch?v=E4v2StDdYqs
/watch?v=x7kCuRB0f7E
/watch?v=KERtHNoZrF0
/watch?v=TenbA4wWIJA
/watch?v=Ey9HfjUyUvY
/watch?v=hqsuOT0URJU
It dynamic html you can use Selenium or to get static html use GoogleBot user-agent
headers = {'User-Agent' : 'Googlebot/2.1 (+http://www.google.com/bot.html)'}
source = requests.get("https://.......", headers=headers).text
soup = BeautifulSoup(source, 'lxml')
links = soup.findAll("a", class_="yt-uix-tile-link")
for link in links:
print(link['href'])
Try looping over the matches:
import urllib2
data = urllib2.urlopen("some_url")
html_data = data.read()
soup = BeautifulSoup(html_data)
for a in soup.findAll('a',href=True):
print a['href']
The class which you're searching does not exist in the scraped html. You can identify it by printing the soup variable.
For example:
a =soup.findAll("a", class_="sign-in-link")
gives output as:
[<a class="sign-in-link" href="https://accounts.google.com/ServiceLogin?passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26feature%3Dplaylist%26hl%3Den%26next%3D%252Fresults%253Fsearch_query%253DMP%252Belection%252Bresults%252B2018%25253A%252BBJP%252Bminister%252Bblames%252Bconspiracy%252Bas%252Breason%252Bwhile%252Blosing&uilel=3&hl=en&service=youtube">Sign in</a>]

Resources