Nodejs: How to reject https posts based on headers

Nodejs: How to reject https posts based on headers - node.js

In a node.js server that accepts HTTPS post requests that are typically pretty large (a few MBs) we want to be able to start processing the requests before the entire thing is accepted by the server.
For example, if a request with a big fat body arrives, we want to look at its path and based on it decide whether to terminate/reject it, without having to wait for the entire request to arrive (and pay IO cost of receiving that fat body).

You could try the the Connect Limit middleware:
https://github.com/senchalabs/connect/blob/master/lib/middleware/limit.js
or, implement your own solution in a similar way by checking req.headers[content-length], etc..

Based on experimentation, it seems that Node.js only fires the request event after parsing the HTTP headers. Meaning there's a chance to examine the headers before we even start listening for the data event.
Thus the solution seems to be to check the headers before reading any data, and potentially rejecting the request at that point. If we don't reject at that point, we start accumulating the data buffers as they arrive and if they exceed the limit (and thus conflict with the reported content length) we have another chance to reject the request right there by calling response.end().

Related

Is a POST request sent over multiple chunks?

I was doing some nodejs and I ran into a scenario in which I had to use POST requests. I saw that node deals with POST requests in a slightly different manner than the GET requests. In the case of POST requests we need to create two event listeners on('data', ...) and on('end', ...) . In the case of GET requests, I found no such complication. All of this led me to believe that maybe GET requests are always guaranteed to be sent within one chunk of data from the client. Whereas, POST requests can be sent over multiple chunks. Am I correct, or is there any flaw in my understanding. Please correct me if so.

GET requests don't usually have a "body" as part of them so once you've read the http request headers, you have everything so there is no need for additional code to read more.
POST requests, on the other hand, usually do have a body so once you've gotten the headers, you then need to read the body.
FYI, TCP is a streaming protocol which means there are no guarantees about what chunks data will arrive in. Even the headers themselves could arrive in multiple packets. But, the http library you're using already takes care of that for you. It reads data until it has all the headers. Reading the body of a POST request is more up to you to do unless you use some sort of body-parser middleware which will read the body for you.

Transactions / request-response-pattern in flow based/reactive programming

So I have been reading about flow based programming (FBP) in the last few days and I have also been reading J. Paul Morrison's book about it. However I feel I still can't really wrap my head around it. The general concept is that you see programming as some sort of assembly line where you have components that take some packet as input and produce some packets as output. You can connect these components and packets travel through the network. While I totally see how this can work for ETL type applications or batch processing, I have no good idea how you could handle things like synchronous request/response patterns or database transactions with it.
For example let's say I have a web server implemented as FPB. This webserver has a GET /user/{id} which should return a JSON with some information about a user. It also has a POST /user/{id} where you can update the user by sending some JSON back to the server. So here is how I would imagine this flow to be looking:
I tried to have many re-usable components instead of putting the whole logic of handling a request into a single component. So there is a HTTP server component which sends out requests to a dispatcher component which then dispatches the requests into subsequent flows. In each flow the request is parsed by a generic "Request parser" component which outputs various parts of the request into the rest of the flow.
The upper part is quite straightforward, I read the entity of the user with the given ID from DB, serialize the object to JSON and then send it back. However at this point we don't really have a reference to the HTTP request anymore, so how would I know where to send this request to?
On the lower part we have some additional complexity because I would like to write to the database in a transactional way. So first a transaction is started (in parallel the request body is parsed into some object), then the user object is retrieved from the database and merged with the inputs from the request. At the end it is written back to the database and the transaction is committed. Finally some "OK" status is responded to the caller. Here I have the additional problem that when committing the transaction I really don't know which transaction to commit. And of course when sending the response I don't know which request to send it to.
So both problems seem to have something in common - a kind of "Context" that spans over many components. On one example it is a HTTP request/response context in the other a transactional context. In regular programming, these contexts are usually handled at the thread level. Since a request runs in a single thread, the transaction and request contexts are bound to a thread-local so they can be accessed everywhere as long as everything is running in the same thread.
In flow based programming, every component runs independently and ideally on separate threads. This is actually a key thing because it allows for parallelization and effective use of multiple processors. However when that thread-local context is no longer there, how can you handle these problems in flow based programming? This would get even more complicated with proper error handling (which I left out in my example).
I figure that when you do reactive style programming where most of the processing is asynchronous and multithreaded as well you will have the same issues, so I wonder if there are patterns to handle this. Do you have real life experience with either reactive style programming or flow based programming and have some hints on how I could solve this problem?

I wrote a quick answer on Twitter - thought I would post it here as well... Apologies for double-posting!
I like substreams for this/these problem(s), where the first Information Packet in the substream provides the "context" you were talking about. This may help: https://github.com/jpaulm/javafbp-websockets... HTH!
PS This loop-style network topology is also the basis of Facebook's new "Flux" technology - see Jing Chen's presentation, in which she compares this approach with MVC: https://www.youtube.com/watch?v=nYkdrAPrdcw

Hopefully this may nudge you in the right direction. I had a similar issue where I needed to perform a synchronous operation in an asynchronous microservice architecture.
How I solved it was using the Observer pattern. I have 3 components; a http server, a callback server and a timer wheel. The http server similar to yours receives the incoming request, the callback server receives the overal result after asynchronous processing and the timer wheel that queues the original http context and reconciles the response to the http request.
When an incoming request is received, the http server creates a correlation id ,appends it to the request metadata, appends the callback server url to the request metadata and finally adds the request and the original http context together into the timer wheel. Then the http server would pass the request to the dispatcher like in your case and send messages to the relevant components for asynchronous processing.
Depending on the outcome of execution of the current processing component, it will retrieve the callback url from the metadata and send the response to the callback server.In your case there's the json serialization or the database write that would do this. The callback server will then extract the correlation Id that was appended and get the corresponding http context and write the response.
NB each timer object in the timer wheel has a timeout that's configurable, that way if the asynchronous processing delays it will timeout and return a configurable message to the http client of the corresponding http context.

Faster HTTP scraping per POST request?

I'm writing an API that returns an array of redirects for any given page:
router.post('/trace', function(req,res){
if(!req.body.link)
return res.status(405).send(""); //error: no link provided!
console.log("\tapi/trace()", req.body.link);
var redirects = [];
function exit(goodbye){
if(goodbye)
console.log(goodbye);
res.status(200).send(JSON.stringify(redirects)); //end
}
function getRedirect(link){
request({ url: link, followRedirect: false }, function (err, response, body) {
if(err)
exit(err);
else if(response.headers.location){
redirects.push(response.headers.location);
getRedirect(response.headers.location);
}
else
exit(); //all done!
});
}
getRedirect(req.body.link);
});
and here is the corresponding browser request:
$.post('/api/trace', { link: l }, cb);
a page will make about 1000 post request very quickly and then waits a very long time to get each request back.
The problem is the response to the nth request is very slow. individual request takes about half a second, but as best I cant tell the express server is processing each link sequentially. I want the server to make all the requests and respond as it receives a response.
Am I correct in assuming express POST router is running processes sequentially? How do I get it to blast all requests and pass the responses as it gets them?

My question is why is it so slow / is POST an async process on a "out of the box" express server?
You may be surprised to find out that this is probably first a browser issue, not a node.js issue.
A browser will have a max number of simultaneous requests it will allow your Javascript ajax to make to same host which will vary slightly from one browser to the next, but is around 6. So, if you're making 1000 requests, then only around 6 are being sent at at time. The rest go in a queue in the browser waiting for prior requests to finish. So, your node server likely isn't getting 1000 simultaneous requests. You should be able to confirm this by logging incoming requests in your node.js app. You will probably see a long delay before it receives the 1000th request (because it's queued by the browser).
Here's a run-down of how many simultanous requests to a given host each of the browser supported (as of a couple years ago): Max parallel http connections in a browser?.
My first recommendation would be to package up an array of requests to make from the client to the server (perhaps 50 at a time) and then send that in one request. That will give your node.js server plenty to chew on and won't run afoul of the browser's connection limit to the same host.
As for the node.js server, it depends a lot on what you're doing. If most of what you're doing in the node.js server is just networking and not a lot of processing that requires CPU cycles, then node.js is very efficient at handling lots and lots of simultaneous requests. If you start engaging a bunch of CPU (processing or preparing results), then you make benefit from either adding worker processes or using node.js clustering. In your case, you may want to use worker processes. You can examine your CPU load when your node.js server is processing a bunch of work and see if the one CPU that node.js is using is anywhere near 100% or not. If it isn't, then you don't need more node.js processes. If it is, then you do need to spread the work over more node.js processes to go faster.
In your specific case, it looks like you're really only doing networking to collect 302 redirect responses. Your single node.js process should be able to handle a lot of those requests very efficiently so probably the issue is just that your client is being throttled by the browser.
If you want to send a lot of requests to the server (so it can get to work on as many as feasible), but want to get results back immediately as they become available, that's a little more work.
One scheme that could work is to open a webSocket or socket.io connection. You can then send a giant array of URLs that you want the server to check for you in one message over the socket.io connection. Then, as the server gets a result, it can send back each individual result (tagged with the URL that it corresponds to). That way, you can somewhat get the best of both worlds with the server crunching on a long list of URLs, but able to send back individual responses as soon as it gets them.
Note, you will probably find that there is an upper limit to how many outbound http requests you may want to run at the same time from your node.js server too. While modern versions of node.js don't throttle you like the browser does, you probably also don't want your node.js server attempting to run 10,000 simultaneous requests because you may exhaust some sort of network resource pool. So, once you get past the client bottleneck, you will want to test your server at different levels of simultaneous requests open to see where it performs best. This is both to optimize its performance, but also to protect your server against attempting to overextend its use of networking or memory resources and get into error conditions.

how to make sure the http response was delivered?

To respond a http request, we can just use return "content" in the method function.
But for some mission-critical use cases, I would like to make sure the http
200 OK response was delivered. Any idea?

The HTTP protocol doesn't work that way. If you need an acknowledgement then you need the client to send the acknowledgement to you.
Or you should look at implementing a bi-direction socket (a sample library is socket.io) where the client can send the ACK. If it is mission critical, then don't let it be on just http, use websockets
Also you can use AJAX callbacks to gather acknowledgment. One way of creating such a solution would be UUID generated for every request and returned as a part of header
$ curl -v http://domain/url
....
response:
X-ACK-Token: 89080-3e432423-234234-23-42323
and then client make a call again
$ curl http://domain/ack/89080-3e432423-234234-23-42323
So the server would know that the given response has been acknowledge by the client. But you cannot enforce automatic ACK, it is still on the client to send it, if they don't, you have no way of knowing
PS: The UUID is not an actual UUID here, just for example shared as random number

Take a look at Microsofts asynchronous server socket.
An asynchronous server socket requires a method to begin accepting connection requests from the network, a callback method to handle the connection requests and begin receiving data from the network, and a callback method to end receiving the data (this is where your client could respond with the success or failure of the HTTP request that was made).
Example

It is not possible with HTTP, if for some reason you can't use Sockets because your implementation requires HTTP (like an API) you must acknowledge a timeout strategy with your client.
It depends on how much cases you want to handle, but for example you can state something like this:
Client generate internal identifier and send HTTP request including that "ClientID" (like a timestamp or a random number) either in the Headers or as a Body parameter.
Server responds 200 OK (or error, does not matter)
Client waits for server answer 60 seconds (you define your maximum timeout).
If it receives the response, handle it and finish.
If it does NOT receive the answer, try again after the timeout including the same "ClientID" generated in the step 1.
Server detects that the "ClientID" was already received.
Either return 409 Conflict informing that it "Already exists" and the client should know how to handle it.
Or just return 200 OK and the client never knew that it was received the first time.
Again, this depends a lot on your business / technical requirements. Because you could even get two or more consecutive loops of timeout handle.
Hope you get an idea.

as #tarun-lalwani already written is the http protocol not designed for that. What you can do is to let the app create a file and your program checks after the 200 respone the existence and the time of the remote file. This have the implication that every 200 response requires another request for the check file

Potential pitfalls in node/express to have callbacks complete AFTER a response is returned to the client?

I want to write a callback that takes a bit of time to complete an external IO operation, but I do not want it to interfere when sending data back to the client. I don't care about waiting for callback completion for purposes of the reply back to the client, but if the callback results in an error, I would like to log it. About 80% of executions will result in this callback executing after the response is sent back to the client and the connection is closed.
My approach works well and I have not seen any problems, but I would like to know whether there are any pitfalls in this approach that I may be unaware of. I would think that node's evented IO would handle this without issue, but I want to make sure before I commit this architecture to production. Any issues that should make me reconsider this approach?

As long as you're not trying to reference that response object after the response is sent, this will not cause any problems. There's nothing special about a request handler that cares one bit about callbacks in its code being invoked after the response is generated.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string