I'd like to provide an endpoint in my API to allow third-parties to send large batches of JSON data. I'm free to define the format of the JSON objects, but my initial thought is a simple array of objects:
{[{"id":1, "name":"Larry"}, {"id":2, "name":"Curly"}, {"id":3, "name":"Moe"}]}
As there could be any number of these objects in the array, I'd need to stream this data in, read each of these objects as they're streamed in, and persist them somewhere.
TL;DR: Stream a large array of JSON objects from the body of an Express POST request.
It's easy to get the most basic of examples out there working as all of them seem to demonstrate this idea using "fs" and working w/ the filesystem.
What I've been struggling with is the Express implementation of this. At this point, I think I've got this working using the "stream-json" package:
const express = require("express");
const router = express.Router();
const StreamArray = require("stream-json/streamers/StreamArray");
router.post("/filestream", (req, res, next) => {
const stream = StreamArray.withParser();
req.pipe(stream).on("data", ({key, value}) => {
console.log(key, value);
}).on("finish", () => {
console.log("FINISH!");
}).on("error", e => {
console.log("Stream error :(");
});
res.status(200).send("Finished successfully!");
});
I end up with a proper readout of each object as it's parsed by stream-json. The problem seems to be with the thread getting blocked while the processing is happening. I can hit this once and immediately get the 200 response, but a second hit blocks the thread until the first batch finishes, while the second also begins.
Is there any way to do something like this w/o spawning a child process, or something like that? I'm unsure what to do with this, so that the endpoint can continue to receive requests while streaming/parsing the individual JSON objects.
Related
In one of the older project I saw two types of handling of form data.
In one of the method it is done using EventEmitter methods like this:
http.createServer(function(req, res){
let decoder = new StringDecoder("utf-8");
let buffer="";
req.on("data", fuction(chunk){
buffer += decoder.write(chunk);
})
});
req.on("end", function(chunk){
// Logic
});
});
Second way(Express Way) of doing this is getting params from request's body.
app.post('/',function(req, res){
const name = req.body.name;
});
As far as I am understanding if posted data is less in size we can use Body to fetch data and if posted data is large we can switch to Buffer.
Is there any other good explanation for this?
Express is just a lib that wraps around the vanilla HTTP lib. If you read the expressjs source code, it's still the Node HTTP lib under the hood. When the body is too big, streams can be use to write data continuously and effectively.
I'm going to use the following code as an example to frame my question. It's basically just the code required to pull a list of to dos from an SQLite3 database:
So, there's an axios request in the front end:
useEffect(() => {
axios.get('http://localhost:3001/todo', {})
.then(res => {
setTodoList(res.data)
})
.catch(err => {
console.log(err)
})
}, [])
...which links to the following function in the back end:
server.get('/todo', (req,res) => {
// res.json(testData)
const todos = db('todos') //this is shorthand for 'return everything from the table 'todos''
.then(todos => {
return res.json(todos)
})
})
..the data from this GET request is then rendered within a react component, as a list of text.
I'm just confused about the flow of data - when is it HTTP, when is it JSON, what form does the data come out of the database as, and how is it that these different protocol/languages can talk to each other?
I get the overall principle of a GET request and async functions, I just don't get what's going on under the hood. Thanks!
That's a lot of questions about basic issues. But here are some answers. Firstly, you can simplify the server function as:
server.get('/todo', (req, res) => {
db('todos').then(todos => res.json(todos));
});
The data from the db is a Javascript array by the time you are dealing with it in Express. res.json converts it into JSON, which is of course, just a string.
Express creates an HTTP response, which consists of some headers (key value pairs such as Content-Length: and so on) followed by a body, which in your case is just a JSON blob, a string. That response object is sent over the network via HTTP.
The browser receives the response and axios is kind enough to handle the grunt work of reading the headers and turning your JSON back into a Javascript array/object which can then be handled inside React.
The part I can't answer is "how is it that these different protocol/languages can talk to each other", because that is very complex and the question is not well defined. There are many network layers involved.
My current setup involves a Node.js web application using Express.js.
I am using DataDog's dd-tracer to measure the time Node.js spends for particular method invocations as part of my APM solution.
I would like to know if it is possible to measure the portion of time an incoming HTTP request is busy sending data back to the client as HTTP response body.
Are there any pitfalls or inaccuracies involved when trying to do this kind of instrumentation?
Does anybody know why this is not measured by APM client libraries by default?
I would like to know if it is possible to measure the portion of time an incoming HTTP request is busy sending data back to the client as HTTP response body.
You could wrap calls to res.write manually to create additional spans in the request trace. I would only recommend this if there are not many calls to the method within a request, and otherwise I would recommend to capture just a metric instead.
Alternatively, profiling might be an option that would give you a lot more information about exactly what is taking time within the res.write calls.
I look for a "global" solution which can be integrated into a Nest.js application without instrumenting each call to res.write manually.
As described above, you can simply wrap res.write directly at the start of every request. Using the tracer, this can be achieved like this:
res.write = tracer.wrap('http.write', res.write)
This should be done before any other middleware has the chance to write data.
Example middleware:
app.use((req, res) => {
res.write = tracer.wrap('http.write', res.write)
})
Are there any pitfalls or inaccuracies involved when trying to do this kind of instrumentation?
Nothing major that I can think of.
Does anybody know why this is not measured by APM client libraries by default?
The main issue for doing this out of the box is that creating a span for every call to res.write may be expensive if there are too many calls. If you think it would make sense to have an option to do this out of the box, we can definitely consider adding that.
Hope this helps!
It depends if you want to have the response time for each of the calls or if you want to gather statistics about the response time.
For the first, to get the response time in the header of the response for each request, you can use response-time package: https://github.com/expressjs/response-time
This will add to the response header a value (by default X-Response-Time). That will have the the elapsed time from when a request enters the middleware to when the headers are written out.
var express = require('express')
var responseTime = require('response-time')
var app = express()
app.use(responseTime())
app.get('/', function (req, res) {
res.send('hello, world!')
})
If you want a more complete solution and gather statistics that include the response time you can use the
express-node-metrics package
https://www.npmjs.com/package/express-node-metrics
var metricsMiddleware = require('express-node-metrics').middleware;
app.use(metricsMiddleware);
app.get('/users', function(req, res, next) {
//Do Something
})
app.listen(3000);
You can expose and access this statistics like this:
'use strict'
var express = require("express");
var router = express.Router();
var metrics = require('express-node-metrics').metrics;
router.get('/', function (req, res) {
res.send(metrics.getAll(req.query.reset));
});
router.get('/process', function (req, res) {
res.send(metrics.processMetrics(req.query.reset));
});
router.get('/internal', function (req, res) {
res.send(metrics.internalMetrics(req.query.reset));
});
router.get('/api', function (req, res) {
res.send(metrics.apiMetrics(req.query.reset));
});
First of all I state that I don't know dd-tracer, but I can try to provide a way to get the requested time, then it's up to the developer to use it as needed.
The main inaccuracy coming to my mind is that every OS has its own TCP stack and writing on a TCP socket is a buffered operation: for response bodies smaller than OS TCP stack buffer we are probably going to measure a time close to 0; the result we have is moreover influenced by the Node.js event loop load. The larger the response body becomes the more the event loop load related time becomes negligible. So, if we want to measure the write time for all request only to have a single point, but we'll do our analysis only for long time requests, I think the measurement will be quite accurate.
Another possible source of inaccuracy is how the request handlers write their output: if a request handler writes part of the body, then performs a long time operation to compute last part of the body, then writes missing part of the body, the measured time is influenced by the long time computing operation; we should take care that all request handlers write headers and body all at once.
My solution proposal (which works only if the server do not implements keep alive) is to add a middleware like this.
app.use((req, res, next) => {
let start;
const { write } = res.socket;
// Wrap only first write call
// Do not use arrow function to get access to arguments
res.socket.write = function() {
// Immediately restore write property to not wrap next calls
res.socket.write = write;
// Take the start time
start = new Date().getTime();
// Actually call first write
write.apply(res.socket, arguments);
};
res.socket.on("close", () => {
// Take the elapsed time in result
const result = new Date().getTime() - start;
// Handle the result as needed
console.log("elapsed", result);
});
next();
});
Hope this helps.
You can start a timer before res.end and then any code after res.end should run after it is finished so stop the timer after the res.end function. Don't quote me on that tho.
I'm new to Node/Express. I have a long-running series of processes, for example: post to Express endpoint -> save data (can return now) -> handle data -> handle data -> handle data -> another process -> etc.
A typical POST:
app.post("/foo", (req, res) => {
// save data and return
return res.send("200");
// but now I want to do a lot more stuff...
});
If I omit the return then more processing will occur, but even though I' a newbie to this stack, I can tell that's a bad idea.
All I want is to receive some data, save it and return. Then I want to start processing it, and call into other processes, which call into other processes, etc. I don't want the original POST to wait for all this to complete.
I need to do this in-process, so I can't save to a queue and process it separately afterwards.
Basically I want to DECOUPLE the receipt and processing of the data, in process.
What options are available using Node/Express?
I'd try something like this:
const express = require("express");
const port = 3000;
const app = express();
const uuid = require('uuid');
app.post("/foo", (req, res) => {
const requestId = uuid.v4();
// Send result. Set status to 202: The request has been accepted for processing, but the processing has not been completed. See https://tools.ietf.org/html/rfc7231#section-6.3.3.
res.status(202).json({ status: "Processing data..", requestId: requestId });
// Process request.
processRequest(requestId, request);
});
app.get("/fooStatus", (req, res) => {
// Check the status of the request.
let requestId = req.body.requestId;
});
function processRequest(requestId, request) {
/* Process request here, then perhaps save result to db. */
}
app.listen(port);
console.log(`Serving at http://localhost:${port}`);
Calling this with curl (for example):
curl -v -X POST http://localhost:3000/foo
Would give a response like:
{"status":"Processing data..","requestId":"abbf6a8e-675f-44c1-8cdd-82c500cbbb5e"}
There is absolutely nothing wrong with your approach of removing return here and ending the request.....so long as you don't have any other code that tries to send any data back later on.
I'd recommend returning status code 202 Accepted for these long running scenarios though, this indicates to the consumer that the server has accepted the request but it's not finished.
I'm using a proxy middleware to forward multipart data to a different endpoint. I would like to get some information from the stream using previous middleware, and still have the stream readable for the proxy middleware that follows. Is there stream pattern that allows me to do this?
function preMiddleware(req, res, next) {
req.rawBody = '';
req.on('data', function(chunk) {
req.rawBody += chunk;
});
req.on('end', () => {
next();
})
}
function proxyMiddleware(req, res, next) {
console.log(req.rawBody)
console.log(req.readable) // false
}
app.use('/cfs', preMiddleware, proxyMiddleware)
I want to access the name value of <input name="fee" type='file' /> before sending the streamed data to the external endpoint. I think I need to do this because the endpoint parses fee into the final url, and I would like to have a handle for doing some post processing. I'm open to alternative patterns to resolve this.
I don't think there is any mechanism for peeking into a stream without actually permanently removing data from the stream or any mechanism for "unreading" data from a stream to put it back into the stream.
As such, I can think of a few possible ideas:
Read the data you want from the stream and then send the data to the final endpoint manually (not using your proxy code that expects the readable stream).
Read the stream, get the data you want out if it, then create a new readable stream, put the data you read into that readable stream and pass that readable stream onto the proxy. Exactly how to pass it only the proxy will need some looking into the proxy code. You might have to make a new req object that is the new stream.
Create a stream transform that lets you read the stream (potentially even modifying it) while creating a new stream that can be fed to the proxy.
Register your own data event handler, then pause the stream (registering a data even automatically triggers the stream to flow and you don't want it to flow yet), then call next() right away. I think this will allow you to "see" a copy of all the data as it goes by when the proxy middleware reads the stream as there will just be multiple data event handlers, one for your middleware and one for the proxy middleware. This is a theoretical idea - I haven't yet tried it.
You would need to be able to send a single stream in two different directions, which is not gonna be easy if you try it on your own - luckily I wrote a helpful module back in the day rereadable-stream
, that you could use and I'll use scramjet for finding the data you're interested in.
I assume your data will be a multipart-boundary:
const {StringStream} = require('scramjet');
const {ReReadable} = require("rereadable-stream");
// I will use a single middleware, since express does not allow to pass an altered request object to next()
app.use('/cfs', (req, res, next) => {
const buffered = req.pipe(new ReReadable()); // rewind file to
let file = '';
buffered.pipe(new StringStream) // pipe to a StringStream
.lines('\n') // split request by line
.filter(x => x.startsWith('Content-Disposition: form-data;'))
// find form-data lines
.parse(x => x.split(/;\s*/).reduce((a, y) => { // split values
const z = y.split(/:\s*/); // split value name from value
a[z[0]] = JSON.parse(z[1]); // assign to accumulator (values are quoted)
return a;
}, {}))
.until(x => x.name === 'fee' && (file = x.filename, 1))
// run the stream until filename is found
.run()
.then(() => uploadFileToProxy(file, buffered.rewind(), res, next))
// upload the file using your method
});
You'll probably need to adapt this a little to make it work in real world scenario. Let me know if you get stuck or there's something to fix in the above answer.