Processing a lot of requests without crashing - node.js

I am trying to send a lot of https requests and processing that causes my code to crash. I know I can increase my memory but that won't scale. Uncommenting the code below causes OOM crash at some point. The solution should probably be to flush the buffer or something, but I am learning nodejs so not sure what to do.
var https = require('https');
// url anonymized for example
var urlArray = ["https://example.com/blah", ....] // 5000 urls here
var options = {
headers: { "x-api-key": "mykey" }
};
for (let dest of urlArray) {
https.request(dest, options, (res) => {
if (res.statusCode != 200) {
console.log(res.statusCode+" "+res.statusMessage+" at "+dest)
}
})
// uncommenting below causes a crash during runtime
// .on("error", (err) =>
// console.log(err.stack))
.end();
}

NodeJs being non-blocking, is not waiting for the ith http.request to finish before it moves to the i+1th. So the request keeps on accumulating in the memory, and as the memory is not big enough, it crashes. So what we can do here is execute the requests in batches and wait for that batch to finish before starting with the next batch. With this, at any instant, there will be at most n requests present in the memory (n is the batch size).
The code will look something like this:
const request = require('request-promise');
const urlArray = ["https://example.com/blah", ....];
async function batchProcess (urlArray){
const batchSize = 100;
// ^^^^^^^^^^
let i=0;
while(i<urlArray.length) {
let batch = [];
for(let j=0; j<batchSize && i<urlArray.length; j++, i++){
batch.push(request({
uri: urlArray[i],
headers: {
'User-Agent': 'Request-Promise',
"x-api-key": "mykey"
},
json: true // Automatically parses the JSON string in the response
}));
}
let batchResult = await Promise.all(batch);
console.log(batchResult);
}
}
batchProcess(urlArray);

One way would be to turn them into an async iterable where you can run them after another and process them as they return (Apologies for the TypeScript, just pass them through playground to transpile if you don't know TS):
import fetch from "node-fetch";
class MyParallelCalls implements AsyncIterable<any> {
constructor(private readonly urls: string[]) {}
[Symbol.asyncIterator](): AsyncIterator<any> {
return this.iterator();
}
async *iterator(): AsyncGenerator<any> {
for (const url of this.urls) {
yield (await fetch(url, {headers: { "x-api-key": "mykey" }})).json();
}
}
}
async function processAll() {
const calls = new MyParallelCalls(urls);
for await (const call of calls) {
// deal with them as the happen: e.g. pipe them into a process or a destination
console.log(call);
}
}
processAll();
If you want I can modify the above to batch your calls too. It's easy just add an option to the constructor for batch size and you can set how many calls you want to do in your batch and use promise.all for doing the yield.
It will look something like below (Refactored a little so its more generic):
import fetch from "node-fetch";
interface MyParallelCallsOptions {
urls: string[];
batchSize: number;
requestInit: RequestInit;
}
class MyParallelCalls<T> implements AsyncIterable<T[]> {
private batchSize = this.options.batchSize;
private currentIndex = 0;
constructor(private readonly options: MyParallelCallsOptions) {}
[Symbol.asyncIterator](): AsyncIterator<T[]> {
return this.iterator();
}
private getBatch(): string[] {
const batch = this.options.urls.slice(this.currentIndex, this.currentIndex + this.batchSize);
this.setNextBatch();
return batch;
}
private setNextBatch(): void {
this.currentIndex = this.currentIndex + this.batchSize;
}
private isLastBatch(): boolean {
return this.currentIndex === this.options.urls.length;
}
async *iterator(): AsyncGenerator<T[]> {
while (!this.isLastBatch()) {
const batch = this.getBatch();
const requests = batch.map(async (url) => (await fetch(url, this.options.requestInit)).json());
yield Promise.all(requests);
}
}
}
async function processAll() {
const batches = new MyParallelCalls<any>({
urls, // string array of your urls
batchSize: 5,
requestInit: { headers: { "x-api-key": "mykey" } }
});
for await (const batch of batches) {
console.log(batch);
}
}
processAll();

Related

Node JS - Increasing latency for get requests inside a map method

I have a fairly straightforward application which 1) accepts an array of urls 2) iterates over those urls 3) makes a get request to stream media (video / audio). Below is a code snippet illustrating what I'm doing
import request from 'request';
const tasks = urlsArray.map(async (url: string) => {
const startTime = process.hrtime();
let body = '';
let headers = {};
try {
const response = await promisify(request).call(request, {url, method: 'GET',
encoding: null, timeout: 10000});
} catch (e) {
logger.warn(failed to make request)
}
const [seconds] = process.hrtime(startTime);
logger.info(`Took ${seconds} seconds to make request`)
return body;
}
(await Promise.all(tasks)).forEach((body) => {
// processing body...
}
What I'm currently experiencing is that the time to make the request keeps rising as I make more requests and I'm struggling to understand why that is the case.

download <title> of a url with minimal data usage

For the purpose of generating links to another websites I need to download content of tag.
But I would like to use as minimal bandwidth as possible.
In some hardcore variant, to process input stream and close the connection when reached.
Or to e.g. fetch first 1024 chars on first attempt and when it did not contain the whole title as a fallback fetch the whole thing.
What could I use in nodejs to achieve this?
In case someone else is interested, here is what I end up with (very initial version, use just for notion what to use).
A few notes thought:
the core of this solution is to read as few chunks of http response as possible, till the is reached.
not sure, how friendly is by convention to interrupt the connection with response.destroy() (which probably closes the underlying socket).
import {get as httpGet, IncomingMessage} from 'http';
import {get as httpsGet} from 'https';
import {titleParser} from '../ParseTitleFromHtml';
async function parseFromStream(response: IncomingMessage): Promise<string|null> {
let body = '';
for await (const chunk of response) {
const text = chunk.toString();
body += text;
const title = titleParser.parse(body);
if (title !== null) {
response.destroy();
return title;
}
}
response.destroy();
return null;
}
export enum TitleDownloaderRejectionCodesEnum
{
INVALID_STATUS_CODE = 'INVALID_STATUS_CODE',
TIMEOUT = 'TIMEOUT',
NOT_FOUND = 'NOT_FOUND',
FAILED = 'FAILED', // all other errors
}
export class TitleDownloader
{
public async downloadTitle (url: string): Promise<string|null>
{
const isHttps = url.search(/https/i) === 0;
const method = isHttps ? httpsGet : httpGet;
return new Promise((resolve, reject) => {
const clientRequest = method(
url,
async (response) => {
if (!(response.statusCode >= 200 && response.statusCode < 300)) {
clientRequest.abort();
reject(response.statusCode === 404
? TitleDownloaderRejectionCodesEnum.NOT_FOUND
: TitleDownloaderRejectionCodesEnum.INVALID_STATUS_CODE
);
return;
}
const title = await parseFromStream(response);
resolve (title);
}
);
clientRequest.setTimeout(2000, () => {
clientRequest.abort();
reject(TitleDownloaderRejectionCodesEnum.TIMEOUT);
})
.on('error', (err: any) => {
// clear timeout
if (err && err.message && err.message.indexOf('ENOTFOUND') !== -1) {
reject(TitleDownloaderRejectionCodesEnum.NOT_FOUND);
}
reject(TitleDownloaderRejectionCodesEnum.FAILED);
});
});
}
}
export const titleDownloader = new TitleDownloader();

Do node js worker never times out?

I have an iteration that can take up to hours to complete.
Example:
do{
//this is an api action
let response = await fetch_some_data;
// other database action
await perform_operation();
next = response.next;
}while(next);
I am assuming that the operation doesn't times out. But I don't know it exactly.
Any kind of explanation of nodejs satisfying this condition is highly appreciated. Thanks.
Update:
The actual development code is as under:
const Shopify = require('shopify-api-node');
const shopServices = require('../../../../services/shop_services/shop');
const { create } = require('../../../../controllers/products/Products');
exports.initiate = async (redis_client) => {
redis_client.lpop(['sync'], async function (err, reply) {
if (reply === null) {
console.log("Queue Empty");
return true;
}
let data = JSON.parse(reply),
shopservices = new shopServices(data),
shop_data = await shopservices.get()
.catch(error => {
console.log(error);
});
const shopify = new Shopify({
shopName: shop_data.name,
accessToken: shop_data.access_token,
apiVersion: '2020-04',
autoLimit: false,
timeout: 60 * 1000
});
let params = { limit: 250 };
do {
try {
let response = await shopify.product.list(params);
if (await create(response, shop_data)) {
console.log(`${data.current}`);
};
data.current += data.offset;
params = response.nextPageParameters;
} catch (error) {
console.log("here");
console.log(error);
params = false;
};
} while (params);
});
}
Everything is working fine till now. I am just making sure that the execution will ever happen in node or not. This function is call by a cron every minute, and data for processing is provided by queue data.

Capture WebRTC stream

I got this little proof of concept script that I copy/paste into Google Chrome console to capture live webcam video. I capture the chunks every 5 seconds, turn them into blobs, attach to a form data instance and post to a Node server. Then I clean up. It works, but eventually the browser crashes. RAM and CPU spikes heavily.
It seems the problematic areas are creating the Blobs and FormData variables.
How can I improve the script?
To test, go here:
https://www.earthcam.com/usa/arizona/sedona/redrock/?cam=sedona_hd
Copy/paste the script. Check the tab's RAM and CPU consumption.
let chunks = [];
const getOptions = function() {
let options = { mimeType: 'video/webm;codecs=vp9,opus' };
if (!window.MediaRecorder.isTypeSupported(options.mimeType)) {
console.error(`${options.mimeType} is not supported`);
options = { mimeType: 'video/webm;codecs=vp8,opus' };
if (!window.MediaRecorder.isTypeSupported(options.mimeType)) {
console.error(`${options.mimeType} is not supported`);
options = { mimeType: 'video/webm' };
if (!window.MediaRecorder.isTypeSupported(options.mimeType)) {
console.error(`${options.mimeType} is not supported`);
options = { mimeType: '' };
}
}
}
return options;
};
const captureStream = async function(chunks) {
let blob = new window.Blob(chunks, {
type: 'video/webm',
});
let formData = new window.FormData();
formData.append('upl', blob, 'myFile.webm');
await window.fetch('http://localhost:3000', {
method: 'post',
body: formData,
});
blob = null;
formData = null;
console.log(`Saved ${chunks.length}`);
chunks = [];
};
const recordStream = function() {
if (window.MediaRecorder === undefined) {
return console.log('Not supported');
}
const video = document.querySelector('video');
const stream = video.captureStream();
const options = getOptions();
const mediaRecorder = new window.MediaRecorder(stream, options);
mediaRecorder.ondataavailable = function(e) {
if (e.data && e.data.size > 0) {
chunks.push(e.data);
}
};
mediaRecorder.start(0);
// Capture chunks every 5 sec
setInterval(async function() {
await captureStream(chunks);
}, 5000);
};
recordStream();
When I paste in the code above into the Console it displays this error:
Uncaught SyntaxError: Unexpected token '}'
Adding a preceding { then returns this error:
VM97:3 Uncaught ReferenceError: formData is not defined at <anonymous>:3:11

Node.js long process ran twice

I have a Node.js restful API built in express.js framework. It is usually hosted by pm2.
One of the services has very long process. When front end called the service, the process started up. Since there is an error in database, the process won't be done properly and the error would be caught. However, before the process reached the error, another exactly same process started with same parameters. So in the meantime, two processes were both running while one was ahead of the other. After a long time, the first process reached error point and returned error. Then the second one returned exactly the same thing.
I checked front end Network and noticed there was actually only one request sent. Where did the second request come from?
Edit 1:
The whole process is: first process sends query to db -> long time wait -> second process starts up -> second process sends query to db -> long time wait -> first process receives db response -> long time wait -> second process receives db response
Edit 2:
The code of the service is as follow:
import { Express, Request, Response } from "express";
import * as multer from "multer";
import * as fs from "fs";
import { Readable, Duplex } from "stream";
import * as uid from "uid";
import { Client } from "pg";
import * as gdal from "gdal";
import * as csv from "csv";
import { SuccessPayload, ErrorPayload } from "../helpers/response";
import { postgresQuery } from "../helpers/database";
import Config from "../config";
export default class ShapefileRoute {
constructor(app: Express) {
// Upload a shapefile
/**
* #swagger
* /shapefile:
* post:
* description: Returns the homepage
* responses:
* 200:
*/
app.post("/shapefile", (req: Request, res: Response, next: Function): void => {
// Create instance of multer
const multerInstance = multer().array("files");
multerInstance(req, res, (err: Error) => {
if (err) {
let payload: ErrorPayload = {
code: 4004,
errorMessage: "Multer upload file error.",
errorDetail: err.message,
hints: "Check error detail"
};
req.reservePayload = payload;
next();
return;
}
// Extract files
let files: any = req.files;
// Extract body
let body: any = JSON.parse(req.body.filesInfo);
// Other params
let writeFilePromises: Promise<any>[] = [];
let copyFilePromises: Promise<any>[] = [];
let rootDirectory: string = Config.uploadRoot;
let outputId: string = uid(4);
// Reset index of those files
let namesIndex: string[] = [];
files.forEach((item: Express.Multer.File, index: number) => {
if(item.originalname.split(".")[1] === "csv" || item.originalname.split(".")[1] === "txt" || item.originalname.split(".")[1] === "shp") {
namesIndex.push(item.originalname);
}
})
// Process and write all files to disk
files.forEach((item: Express.Multer.File, outterIndex: number) => {
if(item.originalname.split(".")[1] === "csv" || item.originalname.split(".")[1] === "txt") {
namesIndex.forEach((indexItem, index) => {
if(indexItem === item.originalname) {
ShapefileRoute.csv(item, index, writeFilePromises, body, rootDirectory, outputId,);
}
})
} else if (item.originalname.split(".")[1] === "shp") {
namesIndex.forEach((indexItem, index) => {
if(indexItem === item.originalname) {
ShapefileRoute.shp(item, index, writeFilePromises, body, rootDirectory, outputId,);
}
})
} else {
ShapefileRoute.shp(item, outterIndex, writeFilePromises, body, rootDirectory, outputId,);
}
})
// Copy files from disk to database
ShapefileRoute.copyFiles(req, res, next, writeFilePromises, copyFilePromises, req.reserveSuperPg, () => {
ShapefileRoute.loadFiles(req, res, next, copyFilePromises, body, outputId)
});
})
});
}
// Process csv file
static csv(file: Express.Multer.File, index: number, writeFilePromises: Promise<any>[], body: any, rootDirectory: string, outputId: string) {
// Streaming file to pivotcsv
writeFilePromises.push(new Promise((resolve, reject) => {
// Get specification from body
let delimiter: string;
let spec: any;
let lrsColumns: string[] = [null, null, null, null, null, null];
body.layers.forEach((jsonItem, i) => {
if (jsonItem.name === file.originalname.split(".")[0]) {
delimiter = jsonItem.file_spec.delimiter;
spec = jsonItem
jsonItem.lrs_cols.forEach((lrsCol) => {
switch(lrsCol.lrs_type){
case "rec_id":
lrsColumns[0] = lrsCol.name;
break;
case "route_id":
lrsColumns[1] = lrsCol.name;
break;
case "f_meas":
lrsColumns[2] = lrsCol.name;
break;
case "t_meas":
lrsColumns[3] = lrsCol.name;
break;
case "b_date":
lrsColumns[4] = lrsCol.name;
break;
case "e_date":
lrsColumns[5] = lrsCol.name;
break;
}
})
}
});
// Pivot csv file
ShapefileRoute.pivotCsv(file.buffer, `${rootDirectory}/${outputId}_${index}`, index, delimiter, outputId, lrsColumns, (path) => {
console.log("got pivotCsv result");
spec.order = index;
resolve({
path: path,
spec: spec
});
}, reject);
}));
}
// Process shapefile
static shp(file: Express.Multer.File, index: number, writeFilePromises: Promise<any>[], body: any, rootDirectory: string, outputId: string) {
// Write file to disk and then call shp2csv to gennerate csv
writeFilePromises.push(new Promise((resolve, reject) => {
// Write shpefile to disk
fs.writeFile(`${rootDirectory}/shps/${file.originalname}`, file.buffer, (err) => {
// If it is .shp file, resolve it's path and spec
if(file.originalname.split(".")[1] === "shp") {
// Find spec of the shapefile from body
body.layers.forEach((jsonItem, i) => {
if (jsonItem.name === file.originalname.split(".")[0]) {
let recordColumn: string = null;
let routeIdColumn: string = null;
jsonItem.lrs_cols.forEach((lrsLayer) => {
if (lrsLayer.lrs_type === "rec_id") {
recordColumn = lrsLayer.name;
}
if (lrsLayer.lrs_type === "route_id") {
routeIdColumn = lrsLayer.name;
}
})
// Transfer shp to csv
ShapefileRoute.shp2csv(`${rootDirectory}/shps/${file.originalname}`, `${rootDirectory}/${outputId}_${index}`, index, outputId, recordColumn, routeIdColumn, (path, srs) => {
// Add coordinate system, geom column and index of this file to spec
jsonItem.file_spec.proj4 = srs;
jsonItem.file_spec.geom_col = "geom";
jsonItem.order = index;
// Return path and spec
resolve({
path: path,
spec: jsonItem
})
}, (err) => {
reject;
})
}
});
} else {
resolve(null);
}
})
}));
}
// Copy files to database
static copyFiles(req: Request, res: Response, next: Function, writeFilePromises: Promise<any>[], copyFilePromises: Promise<any>[], client: Client, callback: () => void) {
// Take all files generated by writefile processes
Promise.all(writeFilePromises)
.then((results) => {
// Remove null results. They are from .dbf .shx etc of shapefile.
const files: any = results.filter(arr => arr);
// Create promise array. This will be triggered after all files are written to database.
files.forEach((file) => {
copyFilePromises.push(new Promise((copyResolve, copyReject) => {
let query: string = `copy lbo.lbo_temp from '${file.path}' WITH NULL AS 'null';`;
// Create super user call
postgresQuery(client, query, (data) => {
copyResolve(file.spec);
}, copyReject);
}));
});
// Trigger upload query
callback()
})
.catch((err) => {
// Response as error if any file generating is wrong
let payload: ErrorPayload = {
code: 4004,
errorMessage: "Something wrong when processing csv and/or shapefile.",
errorDetail: err.message,
hints: "Check error detail"
};
req.reservePayload = payload;
next();
})
}
// Load layers in database
static loadFiles(req: Request, res: Response, next: Function, copyFilePromises: Promise<any>[], body: any, outputId: string) {
Promise.all(copyFilePromises)
.then((results) => {
// Resort all results by the order assigned when creating files
results.sort((a, b) => {
return a.order - b.order;
});
results.forEach((result) => {
delete result.order;
});
// Create JSON for load layer database request
let taskJson = body;
taskJson.layers = results;
let query: string = `select lbo.load_layers2(p_session_id := '${outputId}', p_layers := '${JSON.stringify(taskJson)}'::json)`;
postgresQuery(req.reservePg, query, (data) => {
// Get result
let result = data.rows[0].load_layers2.result;
// Return 4003 error if no result
if (!result) {
let payload: ErrorPayload = {
code: 4003,
errorMessage: "Load layers error.",
errorDetail: data.rows[0].load_layers2.error ? data.rows[0].load_layers2.error.message : "Load layers returns no result.",
hints: "Check error detail"
};
req.reservePayload = payload;
next();
return;
}
let payload: SuccessPayload = {
type: "string",
content: "Upload files done."
};
req.reservePayload = payload;
next();
}, (err) => {
req.reservePayload = err;
next();
});
})
.catch((err) => {
// Response as error if any file generating is wrong
let payload: ErrorPayload = {
code: 4004,
errorMessage: "Something wrong when copy files to database.",
errorDetail: err,
hints: "Check error detail"
};
req.reservePayload = payload;
next();
})
}
// Pivot csv process. Write output csv to disk and return path of the file.
static pivotCsv(buffer: Buffer, outputPath: string, inputIndex: number, delimiter: string, outputId: string, lrsColumns: string[], callback: (path: string) => void, errCallback: (err: Error) => void) {
let inputStream: Duplex = new Duplex();
// Define output stream
let output = fs.createWriteStream(outputPath, {flags: "a"});
// Callback when output stream is done
output.on("finish", () => {
console.log("output stream finish");
callback(outputPath);
});
// Define parser stream
let parser = csv.parse({
delimiter: delimiter
});
// Close output stream when parser stream is end
parser.on("end", () => {
console.log("parser stream end");
output.end();
});
// Write data when a chunck is parsed
let header = [null, null, null, null, null, null];
let attributesHeader = [];
let i = 0;
let datumIndex: boolean = true;
parser.on("data", (chunk) => {
console.log("parser received on chunck: ", i);
if (datumIndex) {
chunk.forEach((datum, index) => {
if (lrsColumns.includes(datum)) {
header[lrsColumns.indexOf(datum)] = index;
} else {
attributesHeader.push({
name: datum,
index: index
})
}
});
datumIndex = false;
} else {
i ++;
// let layer_id = ;
let rec_id = header[0] ? chunk[header[0]] : i;
let route_id = header[1] ? chunk[header[1]] : null;
let f_meas = header[2] ? chunk[header[2]] : null;
let t_meas = header[3] ? chunk[header[3]] : null;
let b_date = header[4] ? chunk[header[4]] : null;
let e_date = header[5] ? chunk[header[5]] : null;
let attributes = {};
attributesHeader.forEach((attribute) => {
attributes[attribute.name] = chunk[attribute.index];
});
let attributesOrdered = {};
Object.keys(attributes).sort().forEach((key) => {
attributesOrdered[key] = attributes[key];
});
let outputData = `${outputId}\t${inputIndex}\t${rec_id}\t${route_id}\tnull\t${f_meas}\t${t_meas}\t${b_date}\t${e_date}\tnull\t${JSON.stringify(attributesOrdered)}\n`;
output.write(outputData);
}
});
inputStream.push(buffer);
inputStream.push(null);
inputStream.pipe(parser);
}
// Write shp and transfer to database format. Return file path and projection.
static shp2csv(inputPath: string, outputPath: string, i: number, ouputId: string, recordColumn: string, routeIdColumn: string, callback: (path: string, prj: string) => void, errCallback: (err: Error) => void) {
let dataset = gdal.open(inputPath);
let layercount = dataset.layers.count();
let layer = dataset.layers.get(0);
let output = fs.createWriteStream(outputPath, {flags: "a"});
output.on("finish", () => {
callback(outputPath, layer.srs.toProj4());
});
layer.features.forEach((feature, featureId) => {
let geom;
let recordId: number = null;
let routeId: string = null;
try {
let geomWKB = feature.getGeometry().toWKB();
let geomWKBString = geomWKB.toString("hex");
geom = geomWKBString;
if (recordColumn) {
recordId = feature.fields.get(recordColumn);
}
if (routeIdColumn) {
routeId = feature.fields.get(routeIdColumn);
}
}
catch (err) {
console.log(err);
}
let attributes = {};
let attributesOrdered = {};
feature.fields.forEach((value, field) => {
if (field != recordColumn && field != routeIdColumn) {
attributes[field] = value;
}
});
Object.keys(attributes).sort().forEach((key) => {
attributesOrdered[key] = attributes[key];
});
output.write(`${ouputId}\t${i.toString()}\t${recordId ? recordId : (featureId + 1).toString()}\t${routeId}\tnull\tnull\tnull\tnull\tnull\t${geom}\t${JSON.stringify(attributesOrdered)}\n`);
});
output.end();
}
}
The browser retries some requests if the server doesn't send a response and the browser hits its timeout value. Each browser may be configured with its own timeout, but 2 minutes sounds like it's probably the browser timeout.
You can't control the browser's timeout from your server. Two minutes is just too long to ask it to wait. You need a different design that responds sooner and then communicates back the eventual result later when it's ready. Either client polling or server push with webSocket/socket.io.
For client polling, you could have the server respond immediately from your first request and return back a token (some unique string). Then, the client can ask the server for the response for that token every minute until the server eventually has the response. If the server doesn't yet have the response, it just immediately returns back a code that means no response yet. If so, the client sets a timer and tries again in a minute, sending the token each time so the server knows which request it is asking about.
For server push, the client creates a persistent webSocket or socket.io connection to the server. When the client makes it's long running request, the server just immediately returns the same type of token described above. Then, when the server is done with the request, it sends the token and the final data over the socket.io connection. The client is listening for incoming messages on that socket.io connection and will receive the final response there.

Resources