First Scrapy crawler works, subsequent crawlers in sequence fail

First Scrapy crawler works, subsequent crawlers in sequence fail - python-3.x

I have a script setup like below:
try:
from Xinhua import Xinhua
except:
error_message("Xinhua")
try:
from China_Daily import China_Daily
except:
error_message("China Daily")
try:
from Global_Times import Global_Times
except:
error_message("Global Times")
try:
from Peoples_Daily import Peoples_Daily
except:
error_message("People's Daily")
The purpose is to run a Scrapy crawler for each site, process the results, and upload those results to a database. When I run each script individually, each portion works fine. When I run from the block of code I've outlined, however, only the first Scrapy crawler actually works properly. All of the subsequent ones attempt to access the sites they are supposed to but don't return any results. I don't even get proper error messages back, just some "DEBUG... 200 None" and "[scrapy.crawler] INFO: Overridden settings: {}" lines. I also don't think the issue is my IP or anything being blocked; as soon as the crawlers fail I immediately launch them individually and they work great.
My guess is that the first crawler is leaving some settings behind that are interfering with the subsequent ones, but I haven't been able to find anything. I can rearrange the order of their execution and it is always the first in line that works while the others fail.
Any thoughts?

I fixed the issue by combining each crawler into one script and running them with CrawlerProcess.
spider_settings = [
{"FEEDS":{
xh_crawl_results:{
'format':'json',
'overwrite':True
}}},
{"FEEDS":{
cd_crawl_results:{
'format':'json',
'overwrite':True
}}},
{"FEEDS":{
gt_crawl_results:{
'format':'json',
'overwrite':True
}}},
{"FEEDS":{
pd_crawl_results:{
'format':'json',
'overwrite':True
}}}
]
process_xh = CrawlerRunner(spider_settings[0])
process_cd = CrawlerRunner(spider_settings[1])
process_gt = CrawlerRunner(spider_settings[2])
process_pd = CrawlerRunner(spider_settings[3])
#defer.inlineCallbacks
def crawl():
yield process_xh.crawl(XinhuaSpider)
yield process_cd.crawl(ChinaDailySpider)
yield process_gt.crawl(GlobalTimesSpider)
yield process_pd.crawl(PeoplesDailySpider)
reactor.stop()
print("Scraping started.")
crawl()
reactor.run()
print("Scraping completed.")

Related

What is making the webbrowser close before it finishes?

I have the code bellow which I know has worked before but for some reason seems to be broken now. The code is mean't to open a search engine, search for a query and return a list of results by the href tag. The webbrowser will open and navigate to http://www.startpage.com success fully, it then puts the term I have entered at the bottom into the search box but then just closes the browser. No error, no links. Nothing.
import selenium.webdriver as webdriver
def get_results(search_term):
url = "https://www.startpage.com"
browser = webdriver.Firefox()
browser.get(url)
search_box = browser.find_element_by_id("query")
search_box.send_keys(search_term)
search_box.submit()
try:
links = browser.find_elements_by_xpath("//ol[#class='web_regular_results']//h3//a")
except:
links = browser.find_elements_by_xpath("//h3//a")
results = []
for link in links:
href = link.get_attribute("href")
print(href)
results.append(href)
browser.close()
return results
get_results("dog")
Does anyone know what is wrong with this? Basically it gets to search_box.submit() then skips everything until browser.close().

Unlike find_element_by_xpath (single WebElement) If find_elements_by_xpath won't find any results it won't throw an exception, it will return an empty list. links is empty so the for loop is never executed. You can change the try except to if condition, and check if it has values
links = browser.find_elements_by_xpath("//ol[#class='web_regular_results']//h3//a")
if not links:
links = browser.find_elements_by_xpath("//h3//a")

It is not recommended to use browser close function within the function that you are testing. Instead you can use after the get_results("dog") function and keep the test logic away.
get_results("dog")
browser.close()
By doing this way selenium will complete the execution of the function first and then close the browser window.
The problem with your solution is that the method is returning the result set after the browser is closing the window due to which you are facing logical problem with your script.

Flask_Sqlalchemy with multithreaded Apache. Sessions out of sync with database

Background: Apache server using mod_wsgi to serve a Flask app using Flask_Sqlalchemy connecting to MySQL. This is a full stack application so it is nearly impossible to create a minimal example but I have tried.
My problem is that when I make some change that should modify the database subsequent requests don't always seem to reflect that change. For example if I create an object, then try to edit that same object, the edit will sometimes fail.
Most of the time if I create an object then go to the page listing all the objects, it will not show up on the list. Sometimes it will show up until I refresh, when it will disappear, and with another refresh it shows up again.
The same happens with edits. Example code:
bp = Blueprint('api_region', __name__, url_prefix='/app/region')
#bp.route('/rename/<int:region_id>/<string:name>', methods=['POST'])
def change_name(region_id, name):
region = Region.query.get(region_id)
try:
region.name = name
except AttributeError:
abort(404)
db.session.add(region)
db.session.commit()
return "Success"
#bp.route('/name/<int:region_id>/', methods=['GET'])
def get_name(region_id):
region = Region.query.get(region_id)
try:
name = region.name
except AttributeError:
abort(404)
return name
After object is created send a POST
curl -X POST https://example.com/app/region/rename/5/Europe
Then several GETs
curl -X GET https://example.com/app/region/name/5/
Sometimes, the GET will return the correct info, but every now and then it will return whatever it was before. Further example output https://pastebin.com/s8mqRHSR it happens at varying frequency but about one in 25 will fail, and it isn't always the "last" value either, when testing it seems to get 'stuck' at a certain value no matter how many times I change it up.
I am using the "dynamically bound" example of Flask_Sqlalchemy
db = SQLAlchemy()
def create_app():
app = Flask(__name__)
db.init_app(app)
... snip ...
return app
Which creates a scoped_session accessible in db.session.
Apache config is long and complicated but includes the line
WSGIDaemonProcess pixel processes=5 threads=5 display-name='%{GROUP}'
I can post more information if required.

For reference if anyone finds this thread with the same issue, I fixed my problem.
My Flask App factory function had the line app.app_context().push() leftover from the early days when it was based off a Flask tutorial. Unfortunately snipped out of the example code otherwise it might have been spotted by someone. During a restructuring of the project this line was left out and the problem fixed itself. Not sure why or how this line would cause this issue, and only for some but not all requests.

trouble getting the current url on selenium

I want to get the current url when I am running Selenium.
I looked at this stackoverflow page: How do I get current URL in Selenium Webdriver 2 Python?
and tried the things posted but it's not working. I am attaching my code below:
from selenium import webdriver
#launch firefox
driver = webdriver.Firefox()
url1='https://poshmark.com/search?'
# search in a window a window
driver.get(url1)
xpath='//input[#id="user-search-box"]'
searchBox=driver.find_element_by_xpath(xpath)
brand="freepeople"
style="top"
searchBox.send_keys(' '.join([brand,"sequin",style]))
from selenium.webdriver.common.keys import Keys
#EQUIValent of hitting enter key
searchBox.send_keys(Keys.ENTER)
print(driver.current_url)
my code prints https://poshmark.com/search? but it should print: https://poshmark.com/search?query=freepeople+sequin+top&type=listings&department=Women because that is what selenium goes to.

The issue is that there is no lag between your searchBox.send_keys(Keys.ENTER) and print(driver.current_url).
There should be some time lag, so that the statement can pick the url change. If your code fires before url has actually changed, it gives you old url only.
The workaround would be to add time.sleep(1) to wait for 1 second. A hard coded sleep is not a good option though. You should do one of the following
Keep polling url and wait for the change to happen or the url
Wait for a object that you know would appear when the new page comes
Instead of using Keys.Enter simulate the operation using a .click() on search button if it is available
Usually when you use click method in selenium it takes cared of the page changes, so you don't see such issues. Here you press a key using selenium, which doesn't do any kind of waiting for page load. That is why you see the issue in the first place

I had the same issue and I came up with solution that uses default explicit wait (see how explicit wait works in documentation).
Here is my solution
class UrlHasChanged:
def __init__(self, old_url):
self.old_url = old_url
def __call__(self, driver):
return driver.current_url != self.old_url:
#contextmanager
def url_change(driver):
current_url = driver.current_url
yield
WebDriverWait(driver, 10).until(UrlHasChanged(current_url))
Explanation:
At first, I created my own wait condition (see here) that takes old_url as a parameter (url from before action was made) and checks whether old url is the same like current_url after some action. It returns false when both urls are the same and true otherwise.
Then, I created context manager to wrap action that I wanted to make, and I saved url before action was made, and after that I used WebDriverWait with created before wait condition.
Thanks to that solution I can now reuse this function with any action that changes url to wait for the change like that:
with url_change(driver):
login_panel.login_user(normal_user['username'], new_password)
assert driver.current_url == dashboard.url
It is safe because WebDriverWait(driver, 10).until(UrlHasChanged(current_url)) waits until current url will change and after 10 seconds it will stop waiting by throwing an exception.
What do you think about this?

I fixed this problem by clicking on the button by using href. Then do driver.get(hreflink). Click() was not working for me!

Twisted Reactor not Restartable - Using asynchronous threading repeatedly

I have a list of URLs. I want to get their content asynchronously every 10 seconds.
urls = [
'http://www.python.org',
'http://stackoverflow.com',
'http://www.twistedmatrix.com',
'http://www.google.com',
'http://launchpad.net',
'http://github.com',
'http://bitbucket.org',
]
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(saveResults)
reactor.run()
How do I do this? This code allows me to only get the urls content once. Calling it again throws error.ReactorNotRestartable()
Thanks :)

This is definitely possible with Twisted.
First off, although this is somewhat unrelated to your question, don't use getPage. It's a very limited API, with poor defaults for security on HTTPS. Instead, use Treq.
Now, onto your main question.
The important thing to understand about reactor.run() is that it doesn't mean "run this code here". It means "run the whole program". When reactor.run() exits, it's time for your program to exit.
Lucky for you, Twisted has a nice built-in way to do things on a regular schedule: LoopingCall.
Here's a working example, using treq and LoopingCall:
urls = [
'http://www.python.org',
'http://stackoverflow.com',
'http://www.twistedmatrix.com',
'http://www.google.com',
'http://launchpad.net',
'http://github.com',
'http://bitbucket.org',
]
from twisted.internet.task import LoopingCall
from twisted.internet.defer import gatherResults
from treq import get, content
def fetchWebPages():
return (gatherResults([get(url).addCallback(content) for url in urls])
.addCallback(saveResults))
def saveResults(responses):
print("total: {} bytes"
.format(sum(len(response) for response in responses)))
repeatedly = LoopingCall(fetchWebPages)
repeatedly.start(10.0)
from twisted.internet import reactor
reactor.run()
As a bonus, this handles the case where fetchWebPages takes longer than 10 seconds, and will react intelligently rather than letting too many outstanding requests pile up, or delaying longer and longer as the requests take longer.

Download page with javascript executed

I want to download a page with javascript executed using python. QT is one of solutions and here is the code:
class Downloader(QApplication):
__event = threading.Event()
def __init__(self):
QApplication.__init__(self, [])
self.webView = QWebView()
self.webView.loadFinished.connect(self.loadFinished)
def load(self, url):
self.__event.clear()
self.webView.load(QUrl(url))
while not self.__event.wait(.05): self.processEvents()
return self.webView.page().mainFrame().documentElement() if self.__ok else None
def loadFinished(self, ok):
self.__ok = ok
self.__event.set()
downloader = Downloader()
page = downloader.load(url)
The problem is that sometimes downloader.load() return a page without javascript executed. Downloader.loadStarted() and Downloader.loadFinished() are called only once.
What is the proper way to wait for a complete page download?
EDIT
If add self.webView.page().networkAccessManager().finished.connect(request_ended) into __init__() and define
def request_ended(reply):
print(reply.error(), reply.url().toString())
then it turns out that sometimes reply.error()==QNetworkReply.UnknownNetworkError. This behaviour stands when unreliable proxy is used, that fails to download some of the resources (part of which are js files), hence some of js not being executed. When proxy is not used (== connection is stable), every reply.error()==QNetworkReply.NoError.
So, the updated question is:
Is it possible to retry getting reply.request() and apply it to the self.webView?

JavaScript requires a runtime to be executed with (python alone won't do) a popular one is PhantomJS these days.
Unfortuantely, PhantomJs has no python support anymore so you could resort to e.g. Ghost.py to do this job for you which allows you to selectively execute JS you want.

You should use Selenium
It provides different WebDriver, for example, PhantomJS, or other common browsers, like firefox.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

First Scrapy crawler works, subsequent crawlers in sequence fail - python-3.x

Related

What is making the webbrowser close before it finishes?

Flask_Sqlalchemy with multithreaded Apache. Sessions out of sync with database

trouble getting the current url on selenium

Twisted Reactor not Restartable - Using asynchronous threading repeatedly

Download page with javascript executed

Categories

Resources