My program does this:
Get the XML from my website
Run all the URLs
Get data from my web page (SKU, name, title, price, etc.) with requests
Get the lowest price from another website, by comparing the price with the same SKU with requests.
I'm using with lots of requests, on each def:
def get_Price (SKU):
check ='https://www.XXX='+SKU
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return Price
def get_StoreName (SKU):
check ='https://XXX?keyword='+SKU
r = requests.get(check)
html = requests.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return storeName
def get_h1Tag (u):
html = requests.get(u)
bsObj = BeautifulSoup(html.content,'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
How can I reduce the number of requests or connections to the URL - and use with one request or one connection throughout the whole program ?
I assume this is a script with a group of methods you call in a particular order.
If so, this is a good use case for a dict. I would write a function that memorizes calls to URLs.
You can then reuse this function across your other functions:
requests_cache = {}
def get_url (url, format_parser):
if url not in requests_cache:
r = requests.get(url)
html = requests.get(r.url)
requests_cache[url] = BeautifulSoup(html.content, format_parser)
return requests_cache[url]
def get_Price (makat):
url = 'https://www.zap.co.il/search.aspx?keyword='+makat
bsObj = get_url(url, 'html.parser')
# your code to find the price
return zapPrice
def get_zapStoreName (makat):
url = 'https://www.zap.co.il/search.aspx?keyword='+makat
bsObj = get_url(url, 'html.parser')
# your code to find the store name
return storeName
def get_h1Tag (u):
bsObj = get_url(u, 'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
If you want to avoid a global variable, you can also set requests_cache as attribute of get_url or as a default argument in the definition. The latter would also allow you to bypass the cache by passing an empty dict.
Again, the assumption here is that you are running this code as a script periodically. In that case, the requests_cache will get cleared every time you run the program.
However, if this is part of a larger program, you would want to 'expire' the cache on a regular basis, otherwise you would get the same results every time.
This is a good use case for the requests-cache library. Example:
from requests_cache import CachedSession
# Save cached responses in a SQLite file (scraper_cache.sqlite), and expire after 6 minutes
session = CachedSession('scraper_cache.sqlite', expire_after=360)
def get_Price (SKU):
check ='https://www.XXX='+SKU
r = session.get(check)
html = session.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return Price
def get_StoreName (SKU):
check ='https://XXX?keyword='+SKU
r = session.get(check)
html = session.get(r.url)
bsObj = BeautifulSoup(html.content,'html.parser')
return storeName
def get_h1Tag (u):
html = session.get(u)
bsObj = BeautifulSoup(html.content,'xml')
h1 = bsObj.find('h1',attrs={'itemprop':'name'}).get_text()
return h1
Aside: with or without requests-cache, using sessions is good practice whenever you're making repeated calls to the same host, since it uses connection pooling: https://docs.python-requests.org/en/latest/user/advanced/#session-objects
Related
I use Scrapy 1.5.1
My Goal is to go through entire chain of requests for each variable before moving to the next variable. For some reason Scrapy takes 2 variables, then sends 2 requests, then takes another 2 variables and so on.
CONCURRENT_REQUESTS = 1
Here is my code sample:
def parsed ( self, response):
# inspect_response(response, self)
search = response.meta['search']
for idx, i in enumerate(response.xpath("//table[#id='ctl00_ContentPlaceHolder1_GridView1']/tr")[1:]):
__EVENTARGUMENT = 'Select${}'.format(idx)
data = {
'__EVENTARGUMENT': __EVENTARGUMENT,
}
yield scrapy.Request(response.url, method = 'POST', headers = self.headers, body = urlencode(data),callback = self.res_before_get,meta = {'search' : search}, dont_filter = True)
def res_before_get ( self, response):
# inspect_response(response, self)
url = 'http://www.moj-yemen.net/Search_detels.aspx'
yield scrapy.Request(url, callback = self.results, dont_filter = True)
My desired behavior is:
1 value from Parse is sent to res_before_get and then i do smth with it.
then another values from Parse is sent to res_before_get and so on.
Post
Get
Post
Get
But currently Scrapy takes 2 values from Parse and adds them to queue , then sends 2 requests from res_before_get. Thus im getting duplicate results.
Post
Post
Get
Get
What do I miss?
P.S.
This is asp.net site. Its logic is as follows:
makes POST request with search payload.
Make GET request to get actual data.
Both request share the same sessionID
Thats why it is important to preserve the order.
At the moment im getting POST1 and POST2. And since the sessionID is associated with POST2, both GET1 and GET2 return the same page.
Scrapy works asynchronously, so you cannot expect it to respect the order of your loops or anything.
If you need it to work sequentially, you'll have to accommodate the callbacks to work like that, for example:
def parse1(self, response):
...
yield Request(..., callback=self.parse2, meta={...(necessary information)...})
def parse2(self, response):
...
if (necessary information):
yield Request(...,
callback=self.parse2,
meta={...(remaining necessary information)...},
)
I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.
I'm trying to extract emails from web pages, here is my email grabber function:
def emlgrb(x):
email_set = set()
for url in x:
try:
response = requests.get(url)
soup = bs.BeautifulSoup(response.text, "lxml")
emails = set(re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", soup.text, re.I))
email_set.update(emails)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
return email_set
This function should be fed by another function, that creates a list of url. Feeder function:
def handle_local_links(url, link):
if link.startswith("/"):
return "".join([url, link])
return link
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link.encode("ascii")) for link in links]
return links
It continues with many exceptions, which if raised - return empty list(not important). However return value from get_links() look like this:
["b'https://pythonprogramming.net/parsememcparseface//'"]
of course there are many of links in the list(cannot post it - reputation). emlgrb() function is not able to process the list (InvalidSchema: No connection adapters were found) However if I manually remove b and redundant quotes - so the list looks like this:
['https://pythonprogramming.net/parsememcparseface//']
emlgrb() works. Any suggestion where is the problem or haw to create "cleaning function" to get second list from first - are welcomed.
Thanks
The solution is to drop .encode('ascii')
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link) for link in links]
return links
You can add coding in str() like in this pydoc: str(object=b'', encoding='utf-8', errors='strict')
That's because str() calls .__repr__() or .__str__() on the object, thus if it is bytes, then output is "b'string'". Actually that's what gets printed when you do print(bytes_obj). And calling .ecnode() on str object creates bytes object!
I'm using bs4 and iterated through all of the links on a single page I need. I then stored those links in a list.
Here's my code:
def scrape1(self):
html = self.browser.page_source
soup = BeautifulSoup(html, 'html.parser')
# add links to list for later use
urls = []
for videos in soup.find_all('a', {'class': 'watch-now'}):
links = videos['href']
urls.append(links)
return urls
def use(self):
urls = scrape1()
I thought when using return I could use the urls in a different method? I want to be able to use every link I appended to the url list, is their a better way to do this when using classes that I'm not understanding?
Since these are the instance methods, you should be using self to call them:
def use(self):
urls = self.scrape1()
And, you don't have to return from the scrape1() method and can set an instance attribute, e.g.:
class MyScraper():
# ...
def scrape1(self):
html = self.browser.page_source
soup = BeautifulSoup(html, 'html.parser')
self.urls = [a['href'] for a in soup.select('a.watch-now')]
def use(self):
self.scrape1()
# use self.urls
print(self.urls)
And, you will be able to use the urls this way as well:
scraper = MyScraper()
scraper.scrape1()
print(scraper.urls)
you could just have the method return the urls into an attribute of the class.
self.urls = urls
then you could reference that from other methods.
anything with self. are attributes that you can reference across the class. So you could write another method that (without needing to feed it as a parameter for the function) could use self.urls in the function.
So, the purpose of this code is to provide the list of URLs on the page, but I found out that the amount of outputed URLs depends on the position of elements in the array which is used while iterating, i.e. params = ["src", "href"]
The code contains the working programs with imported Requests library, used requests.get(), response.text, and such structures as lists and loops.
To copy the code, use Expand Snippet button.
Questions:
Why do I get 8 urls when I use "src" on the 0-s position in the params array
and 136 urls when I use "href" on the 0-s position in the params array, see:
How is it possible to obtain all elements (src and href) in the array all_urls?
import requests
domain = "https://www.python.org/"
response = requests.get(domain)
page = response.text
all_urls = set()
params = ["src", "href"]
def getURL(page, param):
start_link = page.find(param)
if start_link == -1:
return None, 0
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1: end_quote]
return url, end_quote
for param in params:
while True:
url, n = getURL(page, param)
page = page[n:]
#count += 1
if url:
if url.startswith('/') or url.startswith('#!'):
all_urls.add(domain + url)
elif url.startswith('http'):
all_urls.add(url)
else:
continue
else:
break
print("all urls length:", len(all_urls))
To answer your questions:
1- This happens because you are consuming your page variable inside the loop
url, n = getURL(page, param)
page = page[n:] // this one here
This just slices the page string after each iteration and reassigns it to the same variable, hence you loose a chunk on each iteration. When you get to the last src or href you are probably already at the end of the document.
2- A very quick fix for your code would be to reset the page for each new param:
for param in params:
page = response.text
while True:
url, n = getURL(page, param)
page = page[n:]
....
However
There is a far better way to handle HTML. Why don't you just use a HTML Parser for this task?
For example you could use BeautifulSoup4, for example: (not optimal code and not tested, just for a fast demonstration)
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.python.org/")
page = BeautifulSoup(response.text, "html.parser")
all_urls = list()
elements = page.find_all(lambda tag: tag.has_attr('src') or tag.has_attr('href'))
for elem in elements:
if elem.has_attr('src'):
all_urls.append(elem['src'])
elif elem.has_attr('href'):
all_urls.append(elem['href'])
print("all urls with dups length:", len(all_urls))
print("all urls w/o dups length:", len(set(all_urls)))