requests disappear after queueing in scrapy - python-3.5

Scrapy seems to complete without processing all the requests. I know this because i am logging before and after queueing the request and I can clearly see that.
I am logging in both parse and error callback methods and none of them got called for those missing requests.
How can I debug what happened to those requests?

You need to add dont_filter=True when re-queueing the request. Though the request may not match other request but Scrapy remembers what requests it has already made and it will filter out if you re-queue it. It will assume it was by mistake.

Related

OPTIONS Preflight request executes POST's code - is that standard?

If I understand correctly, a preflight OPTIONS request is sent as a way of asking "what's allowed here?". Then, once the response comes, if allowed, the calling site sends the POST request (or GET but in my case it's a post). I have figured out that, at least with Azure Function Apps, the OPTIONS request is executing the code that I expected only the POST to execute. I believe this to be the case because once I added some null checking (since the OPTIONS request doesn't have a payload in the body) everything worked fine.
I'm wondering if this is standard.
Seems to me that if I had written the API without using Azure Function Apps, I'd have the OPTIONS request sent down a path that would set the appropriate headers and return a 200 response. And the POST request would be sent down a different path that would expect a payload in the body. If that's how it usually works then that means I've just found an idiosyncrasy of the Azure functionality. But if not it means that I have something to learn about the OPTIONS preflight request.
Thanks in advance for your advice.
Denise
As sideshowbarker mentioned, the OPTIONS request is sent automatically by the browser to check if the cross-origin request can be made.
In case of Azure Functions, this will handled by the Azure when running in the cloud.
If your function is being triggered, that would mean that you have "options" as a supported method for your HTTP Trigger
In the HTTPTrigger attribute for C# functions
In functions.json for non-C# functions
If you want to customize the CORS responses and/or running functions in a container, you could always include "options" as supported and respond differently when the incoming HTTP method is OPTIONS.
Also, if you are using Azure API Management with Azure Functions, you could offload CORS handling to it instead or even use Functions Proxies as shown here.
Thanks y'all! Sorry I was unclear. And sorry it took me a while to get back. Things have been a bit crazy on this end.
Yes, the function being called is mine. And now I understand the browser doesn't have much choice as to whether or not it makes the OPTIONS call.
And yes, I could make my Azure function handle an options call differently and thanks for that suggestion too. That's sort of what I ended up doing but basically I did it by handling an empty payload. I didn't follow that best practice originally because I thought any valid request would have a payload. Accordingly, any request that did not have a payload was invalid and should be turned away as a failure of some sort. This was before I knew that the OPTIONS call was actually executing that function.
My remaining question is if I had NOT been using Azure... if I had rolled my own solution and hosted it somewhere, I'd have a class or at least methods that handle calls to this particular API. (This is something I'm new to so bear with me if my terms aren't quite right and please do correct me). So if I'd done my own API, I'd have one method to handle a POST call and a different method to handle an OPTIONS call, wouldn't I? And the method that handles the OPTIONS call would return information about what's legally do-able with this API. And the method that handles a POST call would handle the payload sent with it. And the method that handles the POST wouldn't get executed when an OPTIONS request is sent. At least that's how I figured it would work. And that's my question -- is that how it's done when not letting something like Azure handle some of the infrastructure?
I'm just trying to learn if the OPTIONS request executing a POST's function is a standard practice or if it's some kind of idiosyncrasy to working with Azure functions.
Thanks again for the advice and for helping me understand these questions.

Throttling HTTP request (Node.js)

I'm trying to do some basic web scraping and eventually found the need to limit my request calls as the server will return a page not found when there's too many request
Currently I'm using request-promise wrapped in request-promise-retry to make this call and I also found this article which seems to be trying to achieve the same thing Throttle and queue up API requests due to per second cap
I went ahead to try the simple-rate-limiter as it looks simple enough to use but got the following error
TypeError: requestPromise(...).catch is not a function
at promiseRetry.retries (C:\project\node_modules\request-promise-retry\index.js:18:27)
at C:\project\node_modules\promise-retry\index.js:29:24
at <anonymous>
My guess is that the simple-rate-limiter doesn't work with request-promise and only works with request.
Are there any simple ways to go around throttling the requests without having to rewrite all my calls with "request" instead of "request-promise"?
Or is it a good idea to rewrite using normal "request", which will be a pain as I already have a few pieces of code written with request-promise and is expecting a promise to be returned.

Request with RabbitMQ NodeJs

I'm pretty new here, so hope I can get some help with a basic doubt which I couldn't get around yet.
I'm using NodeJs and I have followed the Rabbiq GetStart and could understand the flow, however my doubt is with regards Http request.
What I need:
Manage http (POST, PUT, GET, DELETE) requests to another server.
What I was expecting:
RabitMQ manage the request QUEUE, so if some request fail it would retry again. When its successful, it would call another API on my end to flag the request was successfull.
What is my question:
I couldn't find any example which I would setup this request, providing the sender URL, METHOD and PAYLOAD and also the callback URL, METHOD, HEADERS, and PAYLOAD.
Is that something related to RabbitMQ or am I getting it wrong?

Node.js: Should I discard request on error?

When I build server using Node.js. Requests can sometimes fail. For example, there can be error in parsing POST data. When any error happens:
should I continue handling the request and risk that some of those POST data may be corrupt or missing and respond as if nothing happened (or respond and notify the user, that some error happened)?
try to reparse POST data (and if it fails for, let's say, 3 times, stop trying, add error to error log and show error page to user)?
stop the request handling immediately and throw 500 error?
What is the best way?
The key questions to ask yourself in this situation are IMO:
do I know why the error happened;
can I recover from it?
The answers depend solely on your application. Generally, retrying something only makes sense if you can expect a different outcome with the next attempt. This typically applies to various unexpected errors when integrating with external systems. On the other hand, if the error that you get clearly states that it received e.g. a bad request, or that a file does not exist, then this is probably not going to change no matter how many times you retry the same operation.
If your business rules allow you to continue the operation while ignoring the error entirely, then do so. If your business rules allow you to carry out the request partially and you're able to report the partial failure - then do so. If the error prevents any processing of the request whatsoever, then you'll have to terminate it and report back to the user.
It seems to me that you had a particular situation in mind, so let me address request body re-parsing. In a proper system doing this should be utterly pointless - if you expect different outcomes, then there is something fundamentally wrong with your setup because the body of a request should not change once it's received. Your application should never modify request data in any way.
As a more general rule - if you expect that something unexpected might happen in your application, but you have no idea when, why or how, then there is something wrong with how the application is structured / executed. You should own your code and know exactly what it does.

Where are HTTP Status Codes first available in the IIS Pipeline ?

Does anyone have any information on where the HTTP Status Codes (200, 404, 500, etc) are first available in the IIS Pipeline? I'm trying to write a series of http modules and handlers for error handling purposes and I don't particularly want to duplicate requests/responses to get the values.
Take a look at any of the events here, and take your pick :)
http://msdn.microsoft.com/en-us/library/ms693685(v=vs.90).aspx
Theoretically, the status code can be changed by any of the http modules in the pipeline; it just depends on which events they are subscribed to.
For example, an authorization module may subscribe to the OnAuthorizeRequest method, and perform its logic at that time, and change the status code if needed. In another case, a classic ASP app may run as a handler, and you won't be able to determine if the status code is 500 until OnPostExecuteRequestHandler. Finally, an error in a logging module may generate a 500, which doesn't occur until the request processing is nearly finished (OnLogRequest)
Further complicating matters, some handlers may spit out unbuffered data during execution, so it could be in any of the OnSendResponse events, which don't come in any particular order, and the status code could have changed between responses.
So, it really depends on what you're trying to achieve in order to approach this effectively. If you could provide more detail, perhaps we could formulate a solution.

Resources