apologies if this code has become convoluted. I don't have much experience with threading and I've been throwing everything at the wall to see what sticks.
My goal is to run two parallel instances of a recursive, Selenium-based web crawler script. Each crawler runs a separate instance of ChromeDriver. The browsers each launch separately, but as they start crawling, each browser instance begins to crawl the other's links, switching back and forth throughout the duration.
I've tried adding locks at various points, but these don't seem to help. Ultimately I want each browser to run only one instance of a crawler and close the browser once complete. Is there a way that I can run these separately without cross-interference?
This problem doesn't occur when using multiprocessing, but using this causes other complications crawler class uses un-pickable objects such as an sqllite connector for logging purposes (or so that would seem to be the problem).
For additional context, I'm running these crawlers from a tkinter GUI which displays the current URL that each crawler is on as a status/progress indicator.
Any help or insight here would be much appreciated.
Crawler class:
class Crawler:
def __init__(self, config, browser):
self.config_path = config['path']
self.name = config['name']
self.startURIs = config['startURIs']
self.URIs = config['URIs']
self.maxDepth = config['maxDepth']
self.new_content = 0
self.regex_query = self.CreateURIPattern()
self.options = webdriver.ChromeOptions()
self.download_dir = config['download_dir']
self.browser = browser
def Crawl(self, URL, maxDepth, q):
queue = Queue()
lock = threading.Lock()
print(self.config_path, URL, threading.currentThread().getName())
lock.acquire()
queue.put(URL)
browser = self.browser
q.put([self.config_path, URL])
self.crawled_links.append(URL)
self.browser.get(queue.get())
raw_links = browser.find_elements_by_tag_name('a')
for link in raw_links:
href = link.get_attribute('href')
if href is not None and href not in list(self.links.keys()):
if href.endswith('#') and link.get_attribute('onClick') is not None:
link = re.search(r'(?:https*:\/\/[\w_-]+(?:(?:\.[\w_-]+)+)[\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])',link.get_attribute('onClick'), re.X).group()
else:
link = href
if re.search(self.regex_query, link) is not None and re.search(r'.*#', link) is None:
self.links[link] = {'Referring URL': URL, 'Depth level': maxDepth - 1}
for key, value in list(self.links.items()):
if key not in self.crawled_links and value['Depth level'] > 0:
try:
self.Crawl(key, value['Depth level'], q)
except Exception as e:
print(e)
pass
else:
pass
lock.release()
ThreadPool
class Runner():
def __init__(self, config, browser):
self.config = config
self.browser = browser
def run_crawler(self, browser, q):
with open(self.config, 'rb') as f:
data = json.load(f)
data['path'] = '/'.join(self.config.split('/')[-3:-1])
time.sleep(3)
try:
c = Crawler(data, browser)
for URI in c.startURIs:
c.Crawl(URI, c.maxDepth, q)
done_message = (f'\nCRAWLING COMPLETE: {c.name}.\n{c.new_content} files added.\nCrawler took {self.timer(t1, time.time())}.\n')
print(done_message)
c.browser.quit()
except Exception as e:
print(e)
try:
c.browser.quit()
except:
pass
EDIT:
Sample Output displaying Class instance, chrome session ID, thread number, and URL. In the fourth row, you can see that thread 192084 has begun picking up links intended for Thread 192083 (Swiss Medical Weekly)
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/past-issues
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://bcmj.org/cover/januaryfebruary-2022
Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/archive
British Columbia Medical Journal <selenium.webdriver.chrome.webdriver.WebDriver (session="237e27ff8e35528a0e1c24002d8b4bcb")> Thread-192084 https://smw.ch/issue-1/edn/smw.2022.0910
Swiss Medical Weekly <selenium.webdriver.chrome.webdriver.WebDriver (session="9f0b6f8e2ba9401da74629ef36284316")> Thread-192083 https://smw.ch/issue-1/edn/smw.2022.0708
Related
Before my question, it might be helpful to show you the general structure of my code:
class ItsyBitsy(object):
def __init__(self):
self.targets_a = dict() # data like url:document_summary
# filled by another function
def visit_targets_a(self):
browser = webdriver.Safari()
for url in self.targets_a.keys():
try:
browser.switch_to.new_window('tab')
browser.get(url)
time.sleep(2)
except Exception as e:
print(f'{url} FAILED: {e}')
continue
# do some automation stuff
time.sleep(2)
print('All done!')
I can then instantiate the class and call my method without any issues:
spider = ItsyBitsy()
spider.visit_targets_a()
>>> All done!
However after every tab is opened and automations are completed, the window closes without any prompt even though I do not have browser.close() or browser.exit() anywhere in my code.
My band-aid fix is calling time.sleep(9999999999999999) on the last loop, which keeps the window open indefinitely due to the Overflow Error, but it is obviously not a solution.
So, how do I stop the browser from exiting?!
Bonus points if you can educate me on why this is happening.
Thanks guys/gals!
You need to override exit and prevent 'browser.quit()' from automatically happening.
This keeps the browser open if you set teardown=False:
class ItsyBitsy(object):
def __init__(self, teardown=False):
self.teardown = teardown
self.targets_a = dict() # data like url:document_summary
# filled by another function
self.browser = webdriver.Safari()
def visit_targets_a(self):
for url in self.targets_a.keys():
try:
self.browser.switch_to.new_window('tab')
self.browser.get(url)
time.sleep(2)
except Exception as e:
print(f'{url} FAILED: {e}')
continue
# do some automation stuff
time.sleep(2)
print('All done!')
def __exit__(self, exc_type, exc_val, exc_tb):
if self.teardown:
self.browser.quit()
spider = ItsyBitsy(teardown=False)
spider.visit_targets_a()
Are you using VS Code? Half year ago I had the same problem and switching to Sublime text fixed this. This problem appers because VS Code has a bit wierd way to run python code (via extension) - it kills all processes which were created by the script when the last line of code has been excecuted.
I'm using Selenium to scrape Linkedin for jobs but I'm getting the stale reference error.
I've tried refresh, wait, webdriverwait, a try catch block.
It always fails on page 2.
I'm aware it could be a DOM issue and have run through a few of the answers to that but none of them seem to work for me.
def scroll_to(self, job_list_item):
"""Just a function that will scroll to the list item in the column
"""
self.driver.execute_script("arguments[0].scrollIntoView();", job_list_item)
job_list_item.click()
time.sleep(self.delay)
def get_position_data(self, job):
"""Gets the position data for a posting.
Parameters
----------
job : Selenium webelement
Returns
-------
list of strings : [position, company, location, details]
"""
# This is where the error is!
[position, company, location] = job.text.split('\n')[:3]
details = self.driver.find_element_by_id("job-details").text
return [position, company, location, details]
def wait_for_element_ready(self, by, text):
try:
WebDriverWait(self.driver, self.delay).until(EC.presence_of_element_located((by, text)))
except TimeoutException:
logging.debug("wait_for_element_ready TimeoutException")
pass
logging.info("Begin linkedin keyword search")
self.search_linkedin(keywords, location)
self.wait()
# scrape pages,only do first 8 pages since after that the data isn't
# well suited for me anyways:
for page in range(2, 3):
jobs = self.driver.find_elements_by_class_name("occludable-update")
#jobs = self.driver.find_elements_by_css_selector(".occludable-update.ember-view")
#WebDriverWait(self.driver, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, 'occludable-update')))
for job in jobs:
self.scroll_to(job)
#job.click
[position, company, location, details] = self.get_position_data(job)
# do something with the data...
data = (position, company, location, details)
#logging.info(f"Added to DB: {position}, {company}, {location}")
writer.writerow(data)
# go to next page:
bot.driver.find_element_by_xpath(f"//button[#aria-label='Page {page}']").click()
bot.wait()
logging.info("Done scraping.")
logging.info("Closing DB connection.")
f.close()
bot.close_session()
I expect that when job_list_item.click() is performed the page is loaded, in this case since you are looping jobs which is a list of WebDriverElement will become stale. You are returning back to the page but your jobs is already stale.
Usually to prevent a stale element, I always prevent the use of the element in a loop or store an element to a variable, especially if the element may change.
I am trying to adapt a webscraping script for Semantic Scholar using selenium. This part of the code works as expected, opening the website, typing in the title in the search bar and pressing enter
def _search_paper_by_name(self, paper_title: str) -> None:
"""
Go to the search page for 'paper_name'.
"""
self._web_driver.get(self._site_url)
self._wait_element_by_name('q')
input_search_box = self._web_driver.find_element_by_name('q')
input_search_box.send_keys(paper_title)
input_search_box.send_keys(Keys.ENTER)
However, if at any point I try printing the page's code, the code stops working as expected (Does not type anything on search bar), and no HTML is printed:
def _search_paper_by_name(self, paper_title: str) -> None:
"""
Go to the search page for 'paper_name'.
"""
self._web_driver.get(self._site_url)
self._wait_element_by_name('q')
print(self._web_driver.page_source)
input_search_box = self._web_driver.find_element_by_name('q')
input_search_box.send_keys(paper_title)
input_search_box.send_keys(Keys.ENTER)
I'm confused and I don't understand why I'm not able to print the source code and why does the print statement mess up the rest of the code.
EDIT: As requested
I have a SemanticScraper object. the web_driver gets initialized as None when the object is created:
def init(self, timeout=5, time_between_api_call=0.5, headless=True,
site_url='https://www.semanticscholar.org/'):
"""
:param timeout: Number of seconds the web driver should wait before raising a timeout error.
:param time_between_api_call: Time in seconds between api calls.
:param headless: If set to be true, the web driver launches a browser (chrome) in silent mode.
:param site_url: Home for semantic scholar search engine.
"""
self._site_url = site_url
self._web_driver = None
self._timeout = timeout
self._time_between_api_call = time_between_api_call
self._headless = headless
And then whenever the scraper makes a search, it opens the browser like this:
def _start_browser(self):
chrome_options = Options()
if self._headless:
chrome_options.add_argument("--headless")
self._web_driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='C:\\Users\\username\\Downloads\\chromedriver_win32\\chromedriver.exe')
I am trying to convert my selenium web scraper to scrapy because selenium is nor mainly intended for web scraping.
I just started writing it and have already hit a roadblock. My code is below.
import scrapy
from scrapy.crawler import CrawlerProcess
from pathlib import Path
max_price = "110000"
min_price = "65000"
region_code = "5E430"
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = "https://www.rightmove.co.uk/property-for-sale/find.html?locationIdentifier=REGION%" + region_code + "&minBedrooms=2&maxPrice=" + max_price + "&minPrice=" + min_price + "&propertyTypes=detached" + \
"%2Csemi-detached%2Cterraced&primaryDisplayPropertyType=houses&includeSSTC=false&mustHave=&dontShow=sharedOwnership%2Cretirement&furnishTypes=&keywords="
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
work_path = "C:/Users/Cristi/Desktop/Scrapy_ROI_work_area/"
no_of_pages = response.xpath('//span[#class = "pagination-pageInfo"]').getall()
with open(Path(work_path, "test.txt"), 'wb') as f:
f.write(response.body)
with open(Path(work_path, "extract.txt"), 'wb') as g:
g.write(no_of_pages)
self.log('Saved file test.txt')
process = CrawlerProcess()
process.crawl(QuotesSpider)
process.start()
My roadblock is response.body does not contain the element sought by the xpath expression //span[#class = "pagination-pageInfo"] but the website does have it. I am way out of my depth with the inner workings of websites and am not a programmer by profession....unfortunately. Would anyone help me understand what is happening please?
You have to understand first that there is a big difference in what you are watching in your browser, against what is actually sent to you by the server.
The server, appart from the HTML, most of the times is sending you JavaScript code that has influence over the HTML itself at runtime.
For example, the first GET you do to a page, it can give you an empty table and some JavaScript code. That code then is in charge of hitting a database and filling the table. If you try to scrape that site with Scrapy alone it will fail to get the table because Scrapy does not have a JavaScript engine able to parse the code.
This is your case here, and will be your case for most of the pages you will try to crawl.
You need something to render the code in the page. The best option for Scrapy is Splash:
https://github.com/scrapinghub/splash
Which is a headless and scriptable browser you can use with a Scrapy plugin. It's mantained by Scrapinghub (the creators of Scrapy), so it will work pretty good.
I am executing a Python script with Threading, where given a "query" term that I put in the Queue, I create the url with the query parameters, set the cookies & parse the webpage to return the Products & the urls of those products. Here's the script.
Task : For a given set of queries, store the top 20 product ids in a file, or lower # if the query returns fewer results.
I remember reading that Selenium is not thread safe. Just want to make sure that this problem occurs because of that limitation, and is there a way to make it work in concurrent threads ? The main problem is that the script was I/O bound, so very slow for scraping about 3000 url fetches.
from pyvirtualdisplay import Display
from data_mining.scraping import scraping_conf as sf #custom file with rules for scraping
import Queue
import threading
import urllib2
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
num_threads=5
COOKIES=sf.__MERCHANT_PARAMS[merchant_domain]['COOKIES']
query_args =sf.__MERCHANT_PARAMS[merchant_domain]['QUERY_ARGS']
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def url_from_query(self,query):
for key,val in query_args.items():
if query_args[key]=='query' :
query_args[key]=query
print "query", query
try :
url = base_url+urllib.urlencode(query_args)
print "url"
return url
except Exception as e:
log()
return None
def init_driver_and_scrape(self,base_url,query,url):
# Will use Pyvirtual display later
#display = Display(visible=0, size=(1024, 768))
#display.start()
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("javascript.enabled", True)
driver = webdriver.Firefox(firefox_profile=fp)
driver.delete_all_cookies()
driver.get(base_url)
for key,val in COOKIES[exp].items():
driver.add_cookie({'name':key,'value':val,'path':'/','domain': merchant_domain,'secure':False,'expiry':None})
print "printing cookie name & value"
for cookie in driver.get_cookies():
if cookie['name'] in COOKIES[exp].keys():
print cookie['name'],"-->", cookie['value']
driver.get(base_url+'search=junk') # To counter any refresh issues
driver.implicitly_wait(20)
driver.execute_script("window.scrollTo(0, 2000)")
print "url inside scrape", url
if url is not None :
flag = True
i=-1
row_data,row_res=(),()
while flag :
i=i+1
try :
driver.get(url)
key=sf.__MERCHANT_PARAMS[merchant_domain]['GET_ITEM_BY_ID']+str(i)
print key
item=driver.find_element_by_id(key)
href=item.get_attribute("href")
prod_id=eval(sf.__MERCHANT_PARAMS[merchant_domain]['PRODUCTID_EVAL_FUNC'])
row_res=row_res+(prod_id,)
print url,row_res
except Exception as e:
log()
flag =False
driver.delete_all_cookies()
driver.close()
return query+"|"+str(row_res)+"\n" # row_data, row_res
else :
return [query+"|"+"None"]+"\n"
def run(self):
while True:
#grabs host from queue
query = self.queue.get()
url=self.url_from_query(query)
print "query, url", query, url
data=self.init_driver_and_scrape(base_url,query,url)
self.out_queue.put(data)
#signals to queue job is done
self.queue.task_done()
class DatamineThread(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
data = self.out_queue.get()
fh.write(str(data)+"\n")
#signals to queue job is done
self.out_queue.task_done()
start = time.time()
def log():
logging_hndl=logging.getLogger("get_results_url")
logging_hndl.exception("Stacktrace from "+"get_results_url")
df=pd.read_csv(fh_query, sep='|',skiprows=0,header=0,usecols=None,error_bad_lines=False) # read all queries
query_list=list(df['query'].values)[0:3]
def main():
exp="Control"
#spawn a pool of threads, and pass them queue instance
for i in range(num_threads):
t = ThreadUrl(queue, out_queue)
t.setDaemon(True)
t.start()
#populate queue with data
print query_list
for query in query_list:
queue.put(query)
for i in range(num_threads):
dt = DatamineThread(out_queue)
dt.setDaemon(True)
dt.start()
#wait on the queue until everything has been processed
queue.join()
out_queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
While I should be getting, all search results from each url page, I get only the 1st , i=0 search card and this doesn't execute for all queries/urls. What am I doing wrong ?
What I expect -
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
url inside scrape http://<masked>/search=red+tops
searchResultsItem0
url inside scrape http://<masked>/search=halloween+costumes
searchResultsItem0
and more searchResultsItem(s) , like searchResultsItem1,searchResultsItem2 and so on..
What I get
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
url inside scrape http://<masked>/search=nike+costume
searchResultsItem0
The skeleton code was taken from
http://www.ibm.com/developerworks/aix/library/au-threadingpython/
Additionally when I use Pyvirtual display, will that work with Threading as well ? I also used processes with the same Selenium code, and it gave the same error. Essentially it opens up 3 Firefox browsers, with exact urls, while it should be opening them from different items in the queue. Here I stored the rules in file that will import as sf, which has all custom attributes of a Base Domain.
Since setting the cookies is an integral part of my script, I can't use dryscrape.
EDIT :
I tried to localize the error, and here's what I found -
In the custom rules file, I call "sf" above, I had defined, QUERY_ARGS as
__MERCHANT_PARAMS = {
"some_domain.com" :
{
COOKIES: { <a dict of dict, masked here>
},
... more such rules
QUERY_ARGS:{'search':'query'}
}
So what is really happening is , that on calling,
query_args =sf.__MERCHANT_PARAMS[merchant_domain]['QUERY_ARGS'] - this should return the dict
{'search':'query'}, while it returns,
AttributeError: 'module' object has no attribute
'_ThreadUrl__MERCHANT_PARAMS'
This is where I don't understand how the thread is passing '_ThreadUrl__' I also tried re-initializing the query_args,inside the url_from_query method, but this doesn't work.
Any pointers, on what am I doing wrong ?
I may be replying pretty late to this. However, I tested it python2.7 and both options multithreading and mutliprocess works with selenium and it is opening two separate browsers.