Python returning a list - python-3.x

I'm using bs4 and iterated through all of the links on a single page I need. I then stored those links in a list.
Here's my code:
def scrape1(self):
html = self.browser.page_source
soup = BeautifulSoup(html, 'html.parser')
# add links to list for later use
urls = []
for videos in soup.find_all('a', {'class': 'watch-now'}):
links = videos['href']
urls.append(links)
return urls
def use(self):
urls = scrape1()
I thought when using return I could use the urls in a different method? I want to be able to use every link I appended to the url list, is their a better way to do this when using classes that I'm not understanding?

Since these are the instance methods, you should be using self to call them:
def use(self):
urls = self.scrape1()
And, you don't have to return from the scrape1() method and can set an instance attribute, e.g.:
class MyScraper():
# ...
def scrape1(self):
html = self.browser.page_source
soup = BeautifulSoup(html, 'html.parser')
self.urls = [a['href'] for a in soup.select('a.watch-now')]
def use(self):
self.scrape1()
# use self.urls
print(self.urls)
And, you will be able to use the urls this way as well:
scraper = MyScraper()
scraper.scrape1()
print(scraper.urls)

you could just have the method return the urls into an attribute of the class.
self.urls = urls
then you could reference that from other methods.
anything with self. are attributes that you can reference across the class. So you could write another method that (without needing to feed it as a parameter for the function) could use self.urls in the function.

Related

Beautiful Soup Value not extracting properly

Recently i was working with python beautiful soup to extract some data and put it into pandas DataFrame.
I used python beautiful soup to extract some of the hotel data from the website booking.com.
I was able to extract some of the attributes very correctly without any empty.
Here is my code snippet:
def get_Hotel_Facilities(soup):
try:
title = soup.find_all("div", attrs={"class":"db29ecfbe2 c21a2f2d97 fe87d598e8"})
new_list = []
# Inner NavigatableString Object
for i in range(len(title)):
new_list.append(title[i].text.strip())
except AttributeError:
new_list=""
return new_list
The above code is my function to retrieve the Facilities of a hotel and return the facilitites List items.
page_no=0
d = {"Hotel_Name":[], "Hotel_Rating":[], "Room_type":[],"Room_price":[],"Room_sqft":[],"Facilities":[],"Location":[]}
while (page_no<=25):
URL = f"https://www.booking.com/searchresults.html?aid=304142&label=gen173rf-1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ&sid=2214b1422694e7b065e28995af4e22d9&sb=1&sb_lp=1&src=theme_landing_index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fhotel%2Findex.html%3Faid%3D304142%26label%3Dgen173rf1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ%26sid%3D2214b1422694e7b065e28995af4e22d9%26&ss=goa&is_ski_area=0&checkin_year=2023&checkin_month=1&checkin_monthday=13&checkout_year=2023&checkout_month=1&checkout_monthday=14&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1&offset{page_no}"
new_webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(new_webpage.content,"html.parser")
links = soup.find_all("a", attrs={"class":"e13098a59f"})
for link in links:
new_webpage = requests.get(link.get('href'), headers=HEADERS)
new_soup = BeautifulSoup(new_webpage.content, "html.parser")
d["Hotel_Name"].append(get_Hotel_Name(new_soup))
d["Hotel_Rating"].append(get_Hotel_Rating(new_soup))
d["Room_type"].append(get_Room_type(new_soup))
d["Room_price"].append(get_Price(new_soup))
d["Room_sqft"].append(get_Room_Sqft(new_soup))
d["Facilities"].append(get_Hotel_Facilities(new_soup))
d["Location"].append(get_Hotel_Location(new_soup))
page_no += 25
The above code is the main one where the while loop will traverse the linked pages and retrieve the URL's of the pages. After retrieving ,it goes to every page to retrieve the corresponding atrributes.
I was able to retrieve the rest of the attributes correctly but i am not able to retrive the facilities, Like only some of the room facilities are being returned and some are not returning.
Here is my below o/p after making it into a pandas data frame.
Facilities o/p image
Please help me in this Problem as why some are coming and some are not coming.
P.S:- The facilities are available in the website
I have Tried using all the corresponding classes and attributes for retrieval but i am not getting the facilities column properly.
Probably as a predictive measure, the html fetched by the requests don't seem to consistent in their layouts or even the contents.
There might be more possible selectors, but try
def get_Hotel_Facilities(soup):
selectors = ['div[data-testid="property-highlights"]', '#facilities',
'.hp-description~div div.important_facility']
new_list = []
for sel in selectors:
for sect in soup.select(sel):
new_list += list(sect.stripped_strings)
return list(set(new_list)) # set <--> unique
But even with this, the results are inconsistent. E.g.: I tested on this page with
for i in range(10):
soup = BeautifulSoup(cloudscraper.create_scraper().get(url).content)
fl = get_Hotel_Facilities(soup) if soup else []
print(f'[{i}] {len(fl)} facilities: {", ".join(fl)}')
(But the inconsistencies might be due to using cloudscraper - maybe you'll get better results with your headers?)

How can I reduce the number of requests and use only one?

My program does this:
Get the XML from my website
Run all the URLs
Get data from my web page (SKU, name, title, price, etc.) with requests
Get the lowest price from another website, by comparing the price with the same SKU with requests.
I'm using with lots of requests, on each def:
def get_Price (SKU):
check ='https://www.XXX='+SKU
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return Price
def get_StoreName (SKU):
check ='https://XXX?keyword='+SKU
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return storeName
def get_h1Tag (u):
html = requests.get(u)
bsObj = BeautifulSoup(html.content,'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
How can I reduce the number of requests or connections to the URL - and use with one request or one connection throughout the whole program ?
I assume this is a script with a group of methods you call in a particular order.
If so, this is a good use case for a dict. I would write a function that memorizes calls to URLs.
You can then reuse this function across your other functions:
requests_cache = {}
def get_url (url, format_parser):
if url not in requests_cache:
r = requests.get(url)
html = requests.get(r.url)
requests_cache[url] = BeautifulSoup(html.content, format_parser)
return requests_cache[url]
def get_Price (makat):
url = 'https://www.zap.co.il/search.aspx?keyword='+makat
bsObj = get_url(url, 'html.parser')
# your code to find the price
return zapPrice
def get_zapStoreName (makat):
url = 'https://www.zap.co.il/search.aspx?keyword='+makat
bsObj = get_url(url, 'html.parser')
# your code to find the store name
return storeName
def get_h1Tag (u):
bsObj = get_url(u, 'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
If you want to avoid a global variable, you can also set requests_cache as attribute of get_url or as a default argument in the definition. The latter would also allow you to bypass the cache by passing an empty dict.
Again, the assumption here is that you are running this code as a script periodically. In that case, the requests_cache will get cleared every time you run the program.
However, if this is part of a larger program, you would want to 'expire' the cache on a regular basis, otherwise you would get the same results every time.
This is a good use case for the requests-cache library. Example:
from requests_cache import CachedSession
# Save cached responses in a SQLite file (scraper_cache.sqlite), and expire after 6 minutes
session = CachedSession('scraper_cache.sqlite', expire_after=360)
def get_Price (SKU):
check ='https://www.XXX='+SKU
r = session.get(check)
html = session.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return Price
def get_StoreName (SKU):
check ='https://XXX?keyword='+SKU
r = session.get(check)
html = session.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return storeName
def get_h1Tag (u):
html = session.get(u)
bsObj = BeautifulSoup(html.content,'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
Aside: with or without requests-cache, using sessions is good practice whenever you're making repeated calls to the same host, since it uses connection pooling: https://docs.python-requests.org/en/latest/user/advanced/#session-objects

Scraping with Python 3

Python3:
I'm new to scraping and to train I'm trying to get all the functions from this page:
https://www.w3schools.com/python/python_ref_functions.asp
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
print(soup.td.text)
# Output: abs()
no matter what I try, I only get the 1st one: abs()
Can you help me get them all from abs() to zip()?
To get all similar tags from any webpage use find_all() it returns list of item .
To get all single tag use find() it returns single item.
trick is to get parent tag of all elements which you need then use different methods of your choice and convenience Here you can find more.
from bs4 import BeautifulSoup
import requests
url = "https://www.w3schools.com/python/python_ref_functions.asp"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
#scrape table which contains all functions
tabledata = soup.find("table", attrs={"class": "w3-table-all notranslate"})
#print(tabledata)
#from table data get all a tags of functions
functions = tabledata.find_all("a")
#find_all() method returns list of elements iterate over it
for func in functions:
print(func.contents)
You can use find_all to iterate through ancestors that match the selector:
for tag in soup.find_all('td'):
print(tag.text)
This will include the Description column though, so you'll need to change this to ignore cells.
soup.td will only return the first matching tag.
So one solution would be:
for tag in soup.find_all('tr'):
cell = tag.td
if cell:
print(cell.text)

Prepending a string to the output of a BeautifulSoup scrape

I am scraping a forum page for posts and relevant links using BeautifulSoup.
The links on the page I want are in form r"xx/res/[0-9]{5}.html$".
So far, so good finding them in my BeautifulSoup object, with the following link format returned when I print: /xx/res/83071.html.
I now want to prepend the domain name 'http://website.com' to each result, and use the full url as the basis for further scraping.
My successful code looks like this:
url = 'http://website.com/xx/index.html'
res = urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
links = soup.select('a',{'href':re.compile(r"xx/res/[0-9]{5}.html$")})
for l in links:
print(l['href'])
As a example, the following is printed to the console:
/xx/res/83071.html
/xx/res/81813.html
/xx/res/92014.html
/xx/res/92393.html
Hoping to get some help with the correct syntax to concatenate the prepended string to the output.
Thanks.
This will work for you:-
url = 'http://website.com/xx/index.html'
res = urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
links = soup.select('a',{'href':re.compile(r"xx/res/[0-9]{5}.html$")})
for l in links:
print ('http://website.com'+l['href'])
There are several ways to do it. I personally like the string.format method.
store the base url:
xx = 'base_url'
your print line would be:
print('/{}/{}'.format(xx, l['href']))
where {} gets replaced by .format to instead have the variables you feed into the parameters.

InvalidSchema: No connection adapters were found python3.5.2

I'm trying to extract emails from web pages, here is my email grabber function:
def emlgrb(x):
email_set = set()
for url in x:
try:
response = requests.get(url)
soup = bs.BeautifulSoup(response.text, "lxml")
emails = set(re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", soup.text, re.I))
email_set.update(emails)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
return email_set
This function should be fed by another function, that creates a list of url. Feeder function:
def handle_local_links(url, link):
if link.startswith("/"):
return "".join([url, link])
return link
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link.encode("ascii")) for link in links]
return links
It continues with many exceptions, which if raised - return empty list(not important). However return value from get_links() look like this:
["b'https://pythonprogramming.net/parsememcparseface//'"]
of course there are many of links in the list(cannot post it - reputation). emlgrb() function is not able to process the list (InvalidSchema: No connection adapters were found) However if I manually remove b and redundant quotes - so the list looks like this:
['https://pythonprogramming.net/parsememcparseface//']
emlgrb() works. Any suggestion where is the problem or haw to create "cleaning function" to get second list from first - are welcomed.
Thanks
The solution is to drop .encode('ascii')
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link) for link in links]
return links
You can add coding in str() like in this pydoc: str(object=b'', encoding='utf-8', errors='strict')
That's because str() calls .__repr__() or .__str__() on the object, thus if it is bytes, then output is "b'string'". Actually that's what gets printed when you do print(bytes_obj). And calling .ecnode() on str object creates bytes object!

Resources