Node.js chronological issue - node.js

I have a problem with node.js. The commands of the program doesn't load cronologically and i don't know how to do it.
I'm trying to download some images and text from database and send it with packs of 8. But node.js runs for loop and command after loop at the same time.
Here's my code:
socket.on('background_dinamically', function(data){
connection.query("SELECT * FROM products WHERE id='"+data.cathegory+"'" , function(err, rows, fields){
var count = 0;
var array_elements = [];
if(err){
socket.emit('errorserver');
}else{
for (var i = rows.length - 1, count; i >= 0; i-- & count ++) {
array_elements.push(rows[i]);
if (count == 8) {
socket.emit('image_loading_background', [array_elements, data]);
count = 0;
array_elements = [];
}
};
if(count > 0 && count < 8 && count != 0) {
socket.emit('image_loading_background', [array_elements, data]);
}
}
});
});

Marc, first I would check if synchronisation can be done on the client side. If you force your nodejs app to synchronize before sending data to the client, scalability suffers.
If you cannot do without synchronizing on the server side, you can choose between spaghetti code or a sync lib.

Welcome to the world of asynchronous (not chronological) programming. By default, node will work on I/O operations in parallel as you are seeing. To get other behaviors including chronological (in serial), parallel batches, as well as error handling helpers, have a look at one of the many flow control libraries available. Specifically, I recommend caolan/async.

Related

node-maxmind usage for streaming data with IP address

I have streaming data coming in with IP address. I want to translate the IP to longitude and latitude before putting the data into my database.
This is what I was doing but it is causing some issues. I also tried putting locationObject outside the for loop. That weirdly is using a lot of memory. I know this is blocking code but it should be fast. Though I see memory issue as data object is coming from a stream continuously ans each data object is huge.
for (var i ==0; i < data.length; i++){
if (data.client_ip !== null) {
var locationLookup = maxmind.openSync('./GeoIP2-City.mmdb');
var ip = data.client_ip;
var maxmindObj = locationLookup.get(ip);
locationObject.country = maxmindObj.country.names.en;
locationObject.latitude = maxmindObj.location.latitude;
locationObject.longitude = maxmindObj.location.longitude;
}
}
Again trying to put maxmind.openSync('./GeoIP2-City.mmdb'); outside fr loop is causing huge increase in memory.
The Other option is to use nonblocking code
maxmind.open('/path/to/GeoLite2-City.mmdb', (err, cityLookup) => {
var city = cityLookup.get('66.6.44.4');
});
But I don't think this is a good dea to put this inside a loop.
How can I handle this? I am getting data object every minute
https://github.com/runk/node-maxmind
I'm not sure why you think reading the database file for each iteration of the loop would be fast ("blocking code" doesn't equal "fast code"), it's much better to read the database file once and then loop over data.
maxmind.openSync() will read the entire database into memory, which is mentioned in the README:
Be careful with sync version! Since mmdb files are quite large
(city database is about 100Mb) fs.readFileSync blocks whole
process while it reads file into buffer.
If you don't have memory to spare, the only other option would be to open the file asynchronously. Again, not inside the loop, but outside of it:
maxmind.open("./GeoIP2-City.mmdb", (err, locationLookup) => {
for (var i = 0; i < data.length; i++) {
if (data.client_ip !== null) {
var ip = data.client_ip;
var maxmindObj = locationLookup.get(ip);
locationObject.country = maxmindObj.country.names.en;
locationObject.latitude = maxmindObj.location.latitude;
locationObject.longitude = maxmindObj.location.longitude;
}
}
});
The only thing I am worried is over time I call this function so many times. every time my consumers read jsonObject from kakfa (happening every minute). is there a much better way to optimize that as well. so I call this function every minute. How can I better optimize this further
function processData(jsonObject) {
maxmind.open('./GeoIP2-City.mmdb', function(err, locationLookup) {
if (err) {
logger.error('something went wrong on maxmind fetch', err);
}
for (var i = 0; i < jsonObject.length; i++) { ...}
})
}

Inconsistent request behavior in Node when requesting large number of links?

I am currently using this piece of code to connect to a massive list of links (a total of 2458 links, dumped at https://pastebin.com/2wC8hwad) to get feeds from numerous sources, and to deliver them to users of my program.
It's basically splitting up one massive array into multiple batches (arrays), then forking a process to handle a batch to request each stored link for a 200 status code. Only when a batch is complete is the next batch sent for processing, and when its all done the forked process is disconnected. However I'm facing issues concerning apparent inconsistency in how this is performing with this logic, particularly the part where it requests the code.
const req = require('./request.js')
const process = require('child_process')
const linkList = require('./links.json')
let processor
console.log(`Total length: ${linkList.length}`) // 2458 links
const batchLength = 400
const batchList = [] // Contains batches (arrays) of links
let currentBatch = []
for (var i in linkList) {
if (currentBatch.length < batchLength) currentBatch.push(linkList[i])
else {
batchList.push(currentBatch)
currentBatch = []
currentBatch.push(linkList[i])
}
}
if (currentBatch.length > 0) batchList.push(currentBatch)
console.log(`Batch list length by default is ${batchList.length}`)
// cutDownBatchList(1)
console.log(`New batch list length is ${batchList.length}`)
const startTime = new Date()
getBatchIsolated(0, batchList)
let failCount = 0
function getBatchIsolated (batchNumber) {
console.log('Starting batch #' + batchNumber)
let completedLinks = 0
const currentBatch = batchList[batchNumber]
if (!processor) processor = process.fork('./request.js')
for (var u in currentBatch) { processor.send(currentBatch[u]) }
processor.on('message', function (linkCompletion) {
if (linkCompletion === 'failed') failCount++
if (++completedLinks === currentBatch.length) {
if (batchNumber !== batchList.length - 1) setTimeout(getBatchIsolated, 500, batchNumber + 1)
else finish()
}
})
}
function finish() {
console.log(`Completed, time taken: ${((new Date() - startTime) / 1000).toFixed(2)}s. (${failCount}/${linkList.length} failed)`)
processor.disconnect()
}
function cutDownBatchList(maxBatches) {
for (var r = batchList.length - 1; batchList.length > maxBatches && r >= 0; r--) {
batchList.splice(r, 1)
}
return batchList
}
Below is request.js, using needle. (However, for some strange reason it may completely hang up on a particular site indefinitely - in that case, I just use this workaround)
const needle = require('needle')
function connect (link, callback) {
const options = {
timeout: 10000,
read_timeout: 8000,
follow_max: 5,
rejectUnauthorized: true
}
const request = needle.get(link, options)
.on('header', (statusCode, headers) => {
if (statusCode === 200) callback(null, link)
else request.emit('err', new Error(`Bad status code (${statusCode})`))
})
.on('err', err => callback(err, link))
}
process.on('message', function(linkRequest) {
connect(linkRequest, function(err, link) {
if (err) {
console.log(`Couldn't connect to ${link} (${err})`)
process.send('failed')
} else process.send('success')
})
})
In theory, I think this should perform perfectly fine - it spawns off a separate process to handle the dirty work in sequential batches so its not overloaded and is super scaleable. However, when using using the full list of links at length 2458 with a total of 7 batches, I often get massive "socket hang up" errors on random batches on almost every trial that I do, similar to what would happen if I requested all the links at once.
If I cut down the number of batches to 1 using the function cutDownBatchList it performs perfectly fine on almost every trial. This is all happening on a Linux Debian VPS with two 3.1GHz vCores and 4 GB RAM from OVH, on Node v6.11.2
One thing I also noticed is that if I increased the timeout to 30000 (30 sec) in request.js for 7 batches, it works as intended - however it works perfectly fine with a much lower timeout when I cut it down to 1 batch. If I also try to do all 2458 links at once, with a higher timeout, I also face no issues (which basically makes this mini algorithm useless if I can't cut down the timeout via batch handling links). This all goes back to the inconsistent behavior issue.
The best TLDR I can do: Trying to request a bunch of links in sequential batches in a forked child process - succeeds almost every time with a lower number of batches, fails consistently with full number of batches even though behavior should be the same since its handling it in isolated batches.
Any help would be greatly appreciated in solving this issue as I just cannot for the life of me figure it out!

Best way to execute parallel processing in Node.js

I'm trying to write a small node application that will search through and parse a large number of files on the file system.
In order to speed up the search, we are attempting to use some sort of map reduce. The plan would be the following simplified scenario:
Web request comes in with a search query
3 processes are started that each get assigned 1000 (different) files
once a process completes, it would 'return' it's results back to the main thread
once all processes complete, the main thread would continue by returning the combined result as a JSON result
The questions I have with this are:
Is this doable in Node?
What is the recommended way of doing it?
I've been fiddling, but come no further then following example using Process:
initiator:
function Worker() {
return child_process.fork("myProcess.js");
}
for(var i = 0; i < require('os').cpus().length; i++){
var process = new Worker();
process.send(workItems.slice(i * itemsPerProcess, (i+1) * itemsPerProcess));
}
myProcess.js
process.on('message', function(msg) {
var valuesToReturn = [];
// Do file reading here
//How would I return valuesToReturn?
process.exit(0);
}
Few sidenotes:
I'm aware the number of processes should be dependent of the number of CPU's on the server
I'm also aware of speed restrictions in a file system. Consider it a proof of concept before we move this to a database or Lucene instance :-)
Should be doable. As a simple example:
// parent.js
var child_process = require('child_process');
var numchild = require('os').cpus().length;
var done = 0;
for (var i = 0; i < numchild; i++){
var child = child_process.fork('./child');
child.send((i + 1) * 1000);
child.on('message', function(message) {
console.log('[parent] received message from child:', message);
done++;
if (done === numchild) {
console.log('[parent] received all results');
...
}
});
}
// child.js
process.on('message', function(message) {
console.log('[child] received message from server:', message);
setTimeout(function() {
process.send({
child : process.pid,
result : message + 1
});
process.disconnect();
}, (0.5 + Math.random()) * 5000);
});
So the parent process spawns an X number of child processes and passes them a message. It also installs an event handler to listen for any messages sent back from the child (with the result, for instance).
The child process waits for messages from the parent, and starts processing (in this case, it just starts a timer with a random timeout to simulate some work being done). Once it's done, it sends the result back to the parent process and uses process.disconnect() to disconnect itself from the parent (basically stopping the child process).
The parent process keeps track of the number of child processes started, and the number of them that have sent back a result. When those numbers are equal, the parent received all results from the child processes so it can combine all results and return the JSON result.
For a distributed problem like this, I've used zmq and it has worked really well. I'll give you a similar problem that I ran into, and attempted to solve via processes (but failed.) and then turned towards zmq.
Using bcrypt, or an expensive hashing algorith, is wise, but it blocks the node process for around 0.5 seconds. We had to offload this to a different server, and as a quick fix, I used essentially exactly what you did. Run a child process and send messages to it and get it to
respond. The only issue we found is for whatever reason our child process would pin an entire core when it was doing absolutely no work.(I still haven't figured out why this happened, we ran a trace and it appeared that epoll was failing on stdout/stdin streams. It would also only happen on our Linux boxes and would work fine on OSX.)
edit:
The pinning of the core was fixed in https://github.com/joyent/libuv/commit/12210fe and was related to https://github.com/joyent/node/issues/5504, so if you run into the issue and you're using centos + kernel v2.6.32: update node, or update your kernel!
Regardless of the issues I had with child_process.fork(), here's a nifty pattern I always use
client:
var child_process = require('child_process');
function FileParser() {
this.__callbackById = [];
this.__callbackIdIncrement = 0;
this.__process = child_process.fork('./child');
this.__process.on('message', this.handleMessage.bind(this));
}
FileParser.prototype.handleMessage = function handleMessage(message) {
var error = message.error;
var result = message.result;
var callbackId = message.callbackId;
var callback = this.__callbackById[callbackId];
if (! callback) {
return;
}
callback(error, result);
delete this.__callbackById[callbackId];
};
FileParser.prototype.parse = function parse(data, callback) {
this.__callbackIdIncrement = (this.__callbackIdIncrement + 1) % 10000000;
this.__callbackById[this.__callbackIdIncrement] = callback;
this.__process.send({
data: data, // optionally you could pass in the path of the file, and open it in the child process.
callbackId: this.__callbackIdIncrement
});
};
module.exports = FileParser;
child process:
process.on('message', function(message) {
var callbackId = message.callbackId;
var data = message.data;
function respond(error, response) {
process.send({
callbackId: callbackId,
error: error,
result: response
});
}
// parse data..
respond(undefined, "computed data");
});
We also need a pattern to synchronize the different processes, when each process finishes its task, it will respond to us, and we'll increment a count for each process that finishes, and then call the callback of the Semaphore when we've hit the count we want.
function Semaphore(wait, callback) {
this.callback = callback;
this.wait = wait;
this.counted = 0;
}
Semaphore.prototype.signal = function signal() {
this.counted++;
if (this.counted >= this.wait) {
this.callback();
}
}
module.exports = Semaphore;
here's a use case that ties all the above patterns together:
var FileParser = require('./FileParser');
var Semaphore = require('./Semaphore');
var arrFileParsers = [];
for(var i = 0; i < require('os').cpus().length; i++){
var fileParser = new FileParser();
arrFileParsers.push(fileParser);
}
function getFiles() {
return ["file", "file"];
}
var arrResults = [];
function onAllFilesParsed() {
console.log('all results completed', JSON.stringify(arrResults));
}
var lock = new Semaphore(arrFileParsers.length, onAllFilesParsed);
arrFileParsers.forEach(function(fileParser) {
var arrFiles = getFiles(); // you need to decide how to split the files into 1k chunks
fileParser.parse(arrFiles, function (error, result) {
arrResults.push(result);
lock.signal();
});
});
Eventually I used http://zguide.zeromq.org/page:all#The-Load-Balancing-Pattern, where the client was using the nodejs zmq client, and the workers/broker were written in C. This allowed us to scale this across multiple machines, instead of just a local machine with sub processes.

fs.readFileSync seems faster than fs.readFile - is it OK to use for a web app in production?

I know that when developing in node, you should always try to avoid blocking (sync) functions and go with async functions, however, I did a little test to see how they compare.
I need to open a JSON file that contains i18n data (like date and time formats, etc) and pass that data to a class that uses this data to format numbers, etc in my view.
It would be kind of awkward to start wrapping all the class's methods inside callbacks, so if possible, I would use the synchronous version instead.
console.time('one');
console.time('two');
fs.readFile( this.dir + "/" + locale + ".json", function (err, data) {
if (err) cb( err );
console.timeEnd('one');
});
var data = fs.readFileSync( this.dir + "/" + locale + ".json" );
console.timeEnd('two');
This results in the following lines in my console:
two: 1ms
one: 159ms
It seems that fs.readFileSync is about 150 times faster than fs.readFile - it takes about 1 ms to load a 50KB JSON file (minified). All my JSON files are around 50-100KB.
I was also thinking maybe somehow memoizing or saving this JSON data to the session so that the file is read-only once per session (or when the user changes their locale). I'm not entirely sure how to do that, it's just an idea.
Is it okay to use fs.readFileSync in my case or will I get in trouble later?
No, it is not OK to use a blocking API call in a node server as you describe. Your site's responsiveness to many concurrent connections will take a huge hit. It's also just blatantly violating the #1 principle of node.
The key to node working is that while it is waiting on IO, it is doing CPU/memory processing at the same time. This requires asynchronous calls exclusively. So if you have 100 clients reading 100 JSON files, node can ask the OS to read those 100 files but while waiting for the OS to return the file data when it is available, node can be processing other aspects of those 100 network requests. If you have a single synchronous call, ALL of your client processing stops entirely while that operation completes. So client number 100's connection waits with no processing whatsoever while you read files for clients 1, 2, 3, 4, and so on sequentially. This is Failville.
Here's another analogy. If you went to a restaurant and were the only customer, you would probably get faster service if a single person sat you, took your order, cooked it, served it to you, and handled the bill without the coordination overhead of dealing with the host/hostess, server, head chef, line cooks, cashiers, etc. However, with 100 customers in the restaurant, the extra coordination means things happen in parallel and the overall responsiveness of the restaurant is increased way beyond what it would be if a single person were trying to handle 100 customers on their own.
You are blocking the callback of the asynchronous read with your synchronous read, remember single thread.
Now I understand that the time difference is still amazing, but you should try with a file that is much, much longer to read and imagine that many, many clients will do the same, only then the overhead will pay off.
That should answer your question, yes you will run into trouble if you are serving thousands
of requests with blocking IO.
After a lot of time and a lot of learn & practice I've tried once more and I've found the answer and I can show some example:
const fs = require('fs');
const syncTest = () => {
let startTime = +new Date();
const results = [];
const files = [];
for (let i=0, len=4; i<len; i++) {
files.push(fs.readFileSync(`file-${i}.txt`));
};
for (let i=0, len=360; i<len; i++) results.push(Math.sin(i), Math.cos(i));
console.log(`Sync version: ${+new Date() - startTime}`);
};
const asyncTest = () => {
let startTime = +new Date();
const results = [];
const files = [];
for (let i=0, len=4; i<len; i++) {
fs.readFile(`file-${i}.txt`, file => files.push(file));
};
for (let i=0, len=360; i<len; i++) results.push(Math.sin(i), Math.cos(i));
console.log(`Async version: ${+new Date() - startTime}`);
};
syncTest();
asyncTest();
Yes, it's correct, to deal with the asynchronous way in a server-side environment. But if their use case is different like to generating the build as in client-side JS project, meanwhile reading and writing the JSON files for different flavors.
It doesn't affect that much. Although we needed a rapid manner to create a minified build for deployment (here synchronous comes into the picture).
for more info and library
I've tried to check the real, measurable difference in a speed between fs.readFileSync() and fs.readFile() for downloading 3 different files which are on SD card and I've added between this downloads some math calculation and I don't understand where is the difference in speed which is always showed on node pictures when node is faster also in simple operation like downloading 3 times the same file and the time for this operation is close to time which is needed for downloading 1 time this file.
I understand that this is no doubtly useful that server during downloading some file is able to doing other job but a lot of time on youtube or in books there are some diagrams which are not precise because when you have a situation like below async node is slower then sync in reading small files(like below: 85kB, 170kB, 255kB).
var fs = require('fs');
var startMeasureTime = () => {
var start = new Date().getTime();
return start;
};
// synch version
console.log('Start');
var start = startMeasureTime();
for (var i = 1; i<=3; i++) {
var fileName = `Lorem-${i}.txt`;
var fileContents = fs.readFileSync(fileName);
console.log(`File ${1} was downloaded(${fileContents.length/1000}KB) after ${new Date().getTime() - start}ms from start.`);
if (i === 1) {
var hardMath = 3*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9;
};
};
// asynch version
setImmediate(() => {
console.log('Start');
var start = startMeasureTime();
for (var i = 1; i<=3; i++) {
var fileName = `Lorem-${i}.txt`;
fs.readFile(fileName, {encoding: 'utf8'}, (err, fileContents) => {
console.log(`File ${1} was downloaded(${fileContents.length/1000}KB) after ${new Date().getTime() - start}ms from start.`);
});
if (i === 1) {
var hardMath = 3*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9*54/25*35/46*255/34/9;
};
};
});
This is from console:
Start
File 1 was downloaded(255.024KB) after 2ms from start.
File 1 was downloaded(170.016KB) after 5ms from start.
File 1 was downloaded(85.008KB) after 6ms from start.
Start
File 1 was downloaded(255.024KB) after 10ms from start.
File 1 was downloaded(85.008KB) after 11ms from start.
File 1 was downloaded(170.016KB) after 12ms from start.

Can I allow for "breaks" in a for loop with node.js?

I have a massive for loop and I want to allow I/O to continue while I'm processing. Maybe every 10,000 or so iterations. Any way for me to allow for additional I/O this way?
A massive for loop is just you blocking the entire server.
You have two options, either put the for loop in a new thread, or make it asynchronous.
var data = [];
var next = function(i) {
// do thing with data [i];
process.nextTick(next.bind(this, i + 1));
};
process.nextTick(next.bind(this, 0));
I don't recommend the latter. Your just implementing naive time splicing which the OS level process scheduler can do better then you.
var exec = require("child_process").exec
var s = exec("node " + filename, function (err, stdout, stderr) {
stdout.on("data", function() {
// handle data
});
});
Alternatively use something like hook.io to manage processes for you.
Actually you probably want to aggressively redesign your codebase if you have a blocking for loop.
Maybe something like this to break your loop into chunks...
Instead of:
for (var i=0; i<len; i++) {
doSomething(i);
}
Something like:
var i = 0, limit;
while (i < len) {
limit = (i+10000);
if (limit > len)
limit = len;
process.nextTick(function(){
for (; i<limit; i++) {
doSomething(i);
}
});
}
}
The nextTick() call gives a chance for other events to get in there, but it still does most looping synchronously which (I'm guessing) will be a lot faster than creating a new event for every iteration. And obviously, you can experiment with the number (10,000) till you get the results you want.
In theory, you could also use setTimeout() rather than nextTick(), in case it turns out that giving other processes a somewhat bigger "time-slice" helps. That gives you one more variable (the timeout milliseconds) that you can use for tuning.

Resources