Scraping a lift with Python and BeautifulSoup - python-3.x

I am new to Python and trying to write some code that scrapes information form a website. I currently have:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
for i in range(1, 300):
url = "[REMOVED]/footwear?page=%s" % i
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all('div', 'product-block__info')
for item in items:
for val in item.find_all('a','product-block'):
stock = item.find_all('class','count_product_stock hidden')[0].text
brand = item.find_all('div','brand')[0].text
price = item.find_all('span','selling_price')[0].text
print (items)
Which returns the error IndexError: list index out of range. If I put 'product-block__info' in the place of 'product-block' then I am able to print off the full list of the content within the 'product-block__info' tag on the page, but I'd like to just select a handful of elements and return these.
Can anyone explain to me what's happening here and how I can select just the elements i want from inside 'product-block__info'?

When selecting attributes with find_all you should either use the attrs dictionary or the keyword arguments, otherwise bs4 is lookink for tags.
for i in range(1, 300):
url = "[REMOVED]/footwear?page=%s" % i
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
items = soup.find_all('div', class_='product-block__info')
for item in items:
stock = item.find('span', class_='count_product_stock hidden').text
brand = item.find('h4', class_='brand').text
price = item.find('span', class_='selling_price').text
print(stock, brand, price)

Related

How to print item in a list without a for loop

I'm trying to just get the price off a website and found that "class="udYkAW2UrhZln2Iv62EYb" " gave me the price in one line. but when I try to print it out I keep getting
<span class="udYkAW2UrhZln2Iv62EYb">$0.312423</span>
and not just the price itself. I fixed this by using a for loop to get me item, but is there a way to just display the price with a print function without a for loop?
Please and thank you.
Here's the code
from bs4 import BeautifulSoup as bs
import requests
url = 'https://robinhood.com/crypto/DOGE'
r = requests.get(url)
#make to soup
soup = bs(r.content, 'lxml')
#where the price of the search was found "span class='udYkAW2UrhZln2Iv62EYb'"
#Using find() because this is the first instance of this class
price_class = soup.find('span', {'class' : 'udYkAW2UrhZln2Iv62EYb'})
print(price_class)
type(price_class)
#outout: <span class="udYkAW2UrhZln2Iv62EYb">$0.312423</span>
#output: bs4.element.Tag
for i in price_class:
print(i)
#output: $0.312423
Use .text or .get_text():
from bs4 import BeautifulSoup as bs
import requests
url = "https://robinhood.com/crypto/DOGE"
r = requests.get(url)
soup = bs(r.content, "lxml")
price = soup.find("span", {"class": "udYkAW2UrhZln2Iv62EYb"})
print(price.text) # <--- use .text
Prints:
$0.315917

Why is my company_url variable not being defined?

I am trying to scrape the rank, name and url of a company from a website. This involves two pages and I have nested functions to get all the information I need. However, when I try to print the details I get an error that the company_url variable is not defined. I thought that calling the company_button_url function within the main function would do the job, but something is wrong. I have tried calling company_button_url() at differing points in the code, but cannot get it to work.
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
# Handle 403- Forbidden Error
url = 'https://www.b.co.uk/the-lists/mid-companies/'
req = Request(url, headers={'User-Agent': 'Mozilla'})
html = urlopen(req).read()
html_page = html.decode('utf-8')
soup = BeautifulSoup(html_page, 'html.parser') # create soup object
'''Main Function'''
def company_details():
# find rank
rank = soup.find('div', class_="company-score-redesign").text
# find company name
company_name = soup.find('div', class_="company-name-redesign").text
# find company website
''' Find Button Url...Parse HTML from new Url...Find Company Website '''
def company_button_url():
comp = soup.find('div', class_="company-name-redesign-mobile")
comp_btn = comp.find('a', href = True)
comp_btn_url = comp_btn['href']
new_url = comp_btn_url
# Handle 403- Forbidden Error
new_req = Request(new_url, headers={'User-Agent': 'Mozilla'})
nhtml = urlopen(new_req).read() # Getting new page
nhtml_page = nhtml.decode('utf-8')
nsoup = BeautifulSoup(nhtml_page, 'html.parser') # create new soup object
div_company_url = nsoup.find('div', class_="profile-info")
href_company_url = div_company_url.find('a', href = True)
company_url = href_company_url['href']
return company_url
company_button_url()
print(rank, company_name, company_url)
return()
company_details()
Feel very free to pull my coding to pieces - I am very new to this!
Thanks in advance.

I make a list of URL of different pages for scraping the data. Can anyone tell me that is there any way to automate this process?

from bs4 import BeautifulSoup
import requests
urls = ['https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=2&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=3&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=4&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=5&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=6&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=7&status=all&timeperiod=0',
'https://www.trustpilot.com/categories/restaurants_bars?
numberofreviews=0&page=8&status=all&timeperiod=0']
for url in URLs:
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'lxml')
restaurants = soup.find_all('div', class_ = 'categoryBusinessListWrapper___14CgD')
for index, restaurant in enumerate(restaurants):
tags = restaurant.find_all('a', class_ = 'internal___1jK0Z wrapper___26yB4')
for tag in tags:
restaurant_name = tag.find('div', class_ = 'businessTitle___152-c').text.split(',')[0]
ratings = tag.find('div', class_ = 'textRating___3F1NO')
location = tag.find('span', class_ = 'locationZipcodeAndCity___33EfU')
more_info = tag['href']
As you can see that I create a URLs list to store the URL of different pages on this website. Is there any process to automate this? I use BeautifulSoup and the request module for scraping. I want to know that if there is any process to automate the URL accessing for different pages.
You can look at the pagination at the bottom of the page and use list comprehension to create those links:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&status=all&timeperiod=0'
regex = re.compile('pagination')
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
pages = len(soup.find_all("a", {"class": regex}))
links = ['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page={page}&status=all&timeperiod=0'.format(page=page+1) for page in range(0,pages) ]
Output:
print (links)
['https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=1&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=2&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=3&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=4&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=5&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=6&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=7&status=all&timeperiod=0', 'https://www.trustpilot.com/categories/restaurants_bars?numberofreviews=0&page=8&status=all&timeperiod=0']

Scraping all href links using Pagination

I've to Select each state from https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall and then click on team rankings and after that I've to grab href links of each ranked team.
I've completed till team rankings part now I want get links of each ranked team from all the pages in the pagination bar right now I'm getting links of all teams available on the first page only, I don't how to navigate to the next page.(below is the code)
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
site = "https://www.maxpreps.com"
url = requests.get("https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall")
soup = BeautifulSoup(url.content, "html.parser")
states = soup.findAll('div', {'class': 'states'})
for each_state in states:
all_states = each_state.find_all('a', href=True)
for a in all_states:
domain = site + a['href'] #domain consist oflinks of states
for r in domain:
page_link = domain
page_response = requests.get(page_link)
soup = BeautifulSoup(page_response.content, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("rankings")}):
rankings_link = site + link.get('href')
#print(rankings_link)
for ert in rankings_link:
team_link = rankings_link
page_response1 = requests.get(team_link)
soup = BeautifulSoup(page_response1.content, "html.parser")
My_table = soup.find('table',{'class':'mx-grid sortable rankings-grid'})
links = My_table.findAll('a')
print(links)
output:
Everett, Methuen,
You could just iterate through pages within the query parameters.
import requests
from bs4 import BeautifulSoup
site = "https://www.maxpreps.com"
session = requests.Session()
response = session.get("https://www.maxpreps.com/search/states_by_sport.aspx?gendersport=boys,football&season=fall")
soup = BeautifulSoup(response.content, "html.parser")
all_states = soup.find('div', {'class': 'states'})
states_list = []
for each in all_states.find_all('a'):
states_list.append(each['href'].split('=')[-1])
states_list = states_list[:-1]
team_links = []
url = 'https://www.maxpreps.com/m/rankings/list.aspx'
for state in states_list:
break_loop = False
page=1
while break_loop == False:
print ('%s: Page %s' %(state, page))
payload = {
'page': str(page),
'ssid': '8d610ab9-220b-465b-9cf0-9f417bce6c65',
'state': state
}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find('table')
if table == None:
break_loop = True
else:
page+=1
links = table.find_all('a')
for link in links:
team_links.append('https://www.maxpreps.com' + link['href'])
Output:
print (team_links[:10])
['https://www.maxpreps.com/m/high-schools/central-red-devils-(phenix-city,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/thompson-warriors-(alabaster,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hoover-buccaneers-(hoover,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/oxford-yellow-jackets-(oxford,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mountain-brook-spartans-(birmingham,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/hewitt-trussville-huskies-(trussville,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/mcgill-toolen-yellowjackets-(mobile,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/lee-generals-(montgomery,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/pinson-valley-indians-(pinson,al)/football/default.htm', 'https://www.maxpreps.com/m/high-schools/vestavia-hills-rebels-(vestavia-hills,al)/football/default.htm']

How to only retrieve the tag I specify using BeautifulSoup

I just want the written text out of this website: https://algorithms-tour.stitchfix.com/ so I can put it in Word doc and read it.
When I run the code, I get all the html and the tags, at the very end I get what I want, but I just want to separate the text.
import requests
from bs4 import BeautifulSoup
url = "https://algorithms-tour.stitchfix.com"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
item = soup.find_all("p")
print(item)
Is there a way to get just content so I can clean it up some more?
You have a few options for this. If you only want text found within p tags, you can do this:
import requests
from bs4 import BeautifulSoup
url = "https://algorithms-tour.stitchfix.com"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
items = soup.find_all("p")
result = []
for item in items:
result.append(item.string)
print(result)
Note that soup.find_all returns an iterable list, and not a single object.
An alternative, and easier method is to just use soup.get_text:
import requests
from bs4 import BeautifulSoup
url = "https://algorithms-tour.stitchfix.com"
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())

Resources