NodeJS unzipper stream never finishes

NodeJS unzipper stream never finishes - node.js

I am attempting to download a zip file, extract the contents, and push them into a database. Unfortuantely, my stream never seems to complete, so I never get the opportunity to do clean up and end the process.
I have stripped the code down to the minimum to reproduce the error.
let debugmode = false;
fs.createReadStream(zPath)
.pipe(unzip.Parse())
.pipe(Stream.Transform({
objectMode: true,
transform: async function(entry,e,done) {
console.log('Item: ' + debugmode++ + ' of 819080');
let buff = await entry.buffer();
await entry.autodrain().promise()
done();
}
}))
.on('finish',()=>{
console.log('DONE');
})
;
The log shows the last couople of items, but never issues the word DONE.
Item: 819075
Item: 819076
Item: 819077
Item: 819078
Item: 819079
Item: 819080
Is there something I have done incorrectly? Is there something I can do to monitor for the end of file and kill the stream?
Extra Info
In the actual code, there is also a transform that reports progress based on bytes processed. There are a few bytes processed after this item.
I am using unzipper to do the extract
The zip file is a publicly accessible SEC submissions.zip. I have no problem with companies.zip. (I'm trying to find their linkable page)
I download the zip in full before processing.

Out of frustration, I have implemented a Dead Man's Switch.
let deadman = null;
await new Promise((resolve)=>{
fs.createReadStream(zPath)
.pipe(unzip.Parse())
.pipe(Stream.Transform({
clearTimeout(deadman);
deadman = setTimeout(resolve,60000);
/// still do all the other stuff
}
}))
.on('finish',()=>{
clearTimeout(deadman);
console.log('DONE');
resolve();
})
});
Now, every time it processes an entry, it has 60 seconds to complete processing. If it fails to complete processing in 60 seconds, it is assumed to have died and the promise is resolved. The timer is restarted every time an item is processed (the stream demonstrates it is still alive).
While I do not consider this a solution, just a work around, it is intended to be used as a single process, so it can be terminated after the run (to clean up the memory)

Related

Do not process next job until previous job is completed (BullJS/Redis)?

Basically, each of the clients ---that have a clientId associated with them--- can push messages and it is important that a second message from the same client isn't processed until the first one is finished processing (Even though the client can send multiple messages in a row, and they are ordered, and multiple clients sending messages should ideally not interfere with each other). And, importantly, a job shouldn't be processed twice.
I thought that using Redis I might be able to fix this issue, I started with some quick prototyping using the bull library, but I am clearly not doing it well, I was hoping someone would know how to proceed.
This is what I tried so far:
Create jobs and add them to the same queue name for one process, using the clientId as the job name.
Consume jobs while waiting large random amounts of random time on 2 separate process.
I tried adding the default locking provided by the library that I am using (bull) but it locks on the jobId, which is unique for each job, not on the clientId .
What I would want to happen:
One of the consumers can't take the job from the same clientId until the previous one is finished processing it.
They should be able to, however, get items from different clientIds in parallel without problem (asynchronously). (I haven't gotten this far, I am right now simply dealing with only one clientId)
What I get:
Both consumers consume as many items as they can from the queue without waiting for the previous item for the clientId to be completed.
Is Redis even the right tool for this job?
Example code
// ./setup.ts
import Queue from 'bull';
import * as uuid from 'uuid';
// Check that when a message is taken from a place, no other message is taken
// TO do that test, have two processes that process messages and one that sets messages, and make the job take a long time
// queue for each room https://stackoverflow.com/questions/54178462/how-does-redis-pubsub-subscribe-mechanism-works/54243792#54243792
// https://groups.google.com/forum/#!topic/redis-db/R09u__3Jzfk
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
export async function sleep(ms: number): Promise<void> {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
export interface JobData {
id: string;
v: number;
}
export const queue = new Queue<JobData>('messages', 'redis://127.0.0.1:6379');
queue.on('error', (err) => {
console.error('Uncaught error on queue.', err);
process.exit(1);
});
export function clientId(): string {
return uuid.v4();
}
export function randomWait(minms: number, maxms: number): Promise<void> {
const ms = Math.random() * (maxms - minms) + minms;
return sleep(ms);
}
// Make a job not be called stalled, waiting enough time https://github.com/OptimalBits/bull/issues/210#issuecomment-190818353
// eslint-disable-next-line #typescript-eslint/ban-ts-comment
//#ts-ignore
queue.LOCK_RENEW_TIME = 5 * 60 * 1000;
// ./create.ts
import { queue, randomWait } from './setup';
const MIN_WAIT = 300;
const MAX_WAIT = 1500;
async function createJobs(n = 10): Promise<void> {
await randomWait(MIN_WAIT, MAX_WAIT);
// always same Id
const clientId = Math.random() > 1 ? 'zero' : 'one';
for (let index = 0; index < n; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
const job = { id: clientId, v: index };
await queue.add(clientId, job).catch(console.error);
console.log('Added job', job);
}
}
export async function create(nIds = 10, nItems = 10): Promise<void> {
const jobs = [];
await randomWait(MIN_WAIT, MAX_WAIT);
for (let index = 0; index < nIds; index++) {
await randomWait(MIN_WAIT, MAX_WAIT);
jobs.push(createJobs(nItems));
await randomWait(MIN_WAIT, MAX_WAIT);
}
await randomWait(MIN_WAIT, MAX_WAIT);
await Promise.all(jobs)
process.exit();
}
(function mainCreate(): void {
create().catch((err) => {
console.error(err);
process.exit(1);
});
})();
// ./consume.ts
import { queue, randomWait, clientId } from './setup';
function startProcessor(minWait = 5000, maxWait = 10000): void {
queue
.process('*', 100, async (job) => {
console.log('LOCKING: ', job.lockKey());
await job.takeLock();
const name = job.name;
const processingId = clientId().split('-', 1)[0];
try {
console.log('START: ', processingId, '\tjobName:', name);
await randomWait(minWait, maxWait);
const data = job.data;
console.log('PROCESSING: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('PROCESSED: ', processingId, '\tjobName:', name, '\tdata:', data);
await randomWait(minWait, maxWait);
console.log('FINISHED: ', processingId, '\tjobName:', name, '\tdata:', data);
} catch (err) {
console.error(err);
} finally {
await job.releaseLock();
}
})
.catch(console.error); // Catches initialization
}
startProcessor();
This is run using 3 different processes, which you might call like this (Although I use different tabs for a clearer view of what is happening)
npx ts-node consume.ts &
npx ts-node consume.ts &
npx ts-node create.ts &

I'm not familir with node.js. But for Redis, I would try this,
Let's say you have client_1, client_2, they are all publisher of events.
You have three machines, consumer_1,consumer_2, consumer_3.
Establish a list of tasks in redis, eg, JOB_LIST.
Clients put(LPUSH) jobs into this JOB_LIST, in a specific form, like "CLIENT_1:[jobcontent]", "CLIENT_2:[jobcontent]"
Each consumer takes out jobs blockingly (RPOP command of Redis) and process them.
For example, consumer_1 takes out a job, content is CLIENT_1:[jobcontent]. It parses the content and recognize it's from CLIENT_1. Then it wants to check if some other consumer is processing CLIENT_1 already, if not, it will lock the key to indicate that it's processing CLIENT_1.
It goes on to set a key of "CLIENT_1_PROCESSING" , with content as "consumer_1", using the Redis SETNX command (set if the key not exists), with an appropriate timeout. For example, the task norally takes one minute to finish, you set a timeout of the key of five minutes, just in case consumer_1 crashes and holds on the lock indefinitely.
If the SETNX returns 0, it means it fails to acquire the lock of CLIENT_1 (someone is already processing a job of client_1). Then it returns the job (a value of "CLIENT_1:[jobcontent]")to the left side of JOB_LIST, by using Redis LPUSH command.Then it might wait a bit (sleep a few seconds), and RPOP another task from the right side of the LIST. If this time SETNX returns 1, consumer_1 acquires the lock. It goes on to process job, after it finishes, it deletes the key of "CLIENT_1_PROCESSING", releasing the lock. Then it goes on to RPOP another job, and so on.
Some things to consider:
The JOB_LIST is not fair,eg, earlier jobs might be processed later
The locking part is a bit rudimentary, but will suffice.
----------update--------------
I've figured another way to keep tasks in order.
For each client(producer), build a list. Like "client_1_list", push jobs into the left side of the list.
Save all the client names in a list "client_names_list", with values "client_1", "client_2", etc.
For each consumer(processor), iterate the "client_names_list", for example, consumer_1 get a "client_1", check if the key of client_1 is locked(some one is processing a task of client_1 already), if not, right pop a value(job) from client_1_list and lock client_1. If client_1 is locked, (probably sleep one second) and iterate to the next client, "client_2", for example, and check the keys and so on.
This way, each client(task producer)'s task is processed by their order of entering.

EDIT: I found the problem regarding BullJS is starting jobs in parallel on one processor: We are using named jobs and where defining many named process functions on one queue/processor. The default concurrency factor for a queue/processor is 1. So the queue should not process any jobs in parallel.
The problem with our mentioned setup is if you define many (named) process-handlers on one queue the concurrency is added up with each process-handler function: So if you define three named process-handlers you get a concurrency factor of 3 for given queue for all the defined named jobs.
So just define one named job per queue for queues where parallel processing should not happen and all jobs should run sequentially one after the other.
That could be important e.g. when pushing a high number of jobs onto the queue and the processing involves API calls that would give errors if handled in parallel.
The following text is my first approach of answering the op's question and describes just a workaround to the problem. So better just go with my edit :) and configure your queues the right way.
I found an easy solution to operators question.
In fact BullJS is processing many jobs in parallel on one worker instance:
Let's say you have one worker instance up and running and push 10 jobs onto the queue than possibly that worker starts all processes in parallel.
My research on BullJS-queues gave that this is not intended behavior: One worker (also called processor by BullJS) should only start a new job from the queue when its in idle state so not processing a former job.
Nevertheless BullJS keeps starting jobs in parallel on one worker.
In our implementation that lead to big problems during API calls that most likely are caused by t00 many API calls at a time. Tests gave that when only starting one worker the API calls finished just fine and gave status 200.
So how to just process one job after the other once the previous is finished if BullJS does not do that for us (just what the op asked)?
We first experimented with delays and other BullJS options but thats kind of workaround and not the exact solution to the problem we are looking for. At least we did not get it working to stop BullJS from processing more than one job at a time.
So we did it ourself and started one job after the other.
The solution was rather simple for our use case after looking into BullJS API reference (BullJS API Ref).
We just used a for-loop to start the jobs one after another. The trick was to use BullJS's
job.finished
method to get a Promise.resolve once the job is finished. By using await inside the for-loop the next job gets just started immediately after the job.finished Promise is awaited (resolved). Thats the nice thing with for-loops: Await works in it!
Here a small code example on how to achieve the intended behavior:
for (let i = 0; i < theValues.length; i++) {
jobCounter++
const job = await this.processingQueue.add(
'update-values',
{
value: theValues[i],
},
{
// delay: i * 90000,
// lifo: true,
}
)
this.jobs[job.id] = {
jobType: 'socket',
jobSocketId: BackgroundJobTasks.UPDATE_VALUES,
data: {
value: theValues[i],
},
jobCount: theValues.length,
jobNumber: jobCounter,
cumulatedJobId
}
await job.finished()
.then((val) => {
console.log('job finished:: ', val)
})
}
The important part is really
await job.finished()
inside the for loop. leasingValues.length jobs get started all just one after the other as intended.
That way horizontally scaling jobs across more than one worker is not possible anymore. Nevertheless this workaround is okay for us at the moment.
I will get in contact with optimalbits - the maker of BullJS to clear things out.

Nodejs exec child process stdout not getting all the chunks

I'm trying to send messages from my child process to main my process but some chunks are not being sent, possibly because the file is too big.
main process:
let response = ''
let error = ''
await new Promise(resolve => {
const p = exec(command)
p.stdout.on('data', data => {
// this gets triggered many times because the html string is big and gets split up
response += data
})
p.stderr.on('data', data => {
error += data
})
p.on('exit', resolve)
})
console.log(response)
child process:
// only fetch 1 page, then quit
const bigHtmlString = await fetchHtmlString(url)
process.stdout.write(bigHtmlString)
I know the child process works because when I run the it directly, I can see the end of the file in in the console. But when I run the main process, I cannot see the end of the file. It's quite big so I'm not sure exactly what chunks are missing.
edit: there's also a new unknown problem. when I add a wait at the end of my child process, it doesn't wait, it closes. So I'm guessing it crashes somehow? I'm not seeing any error even with p.on('error', console.log)
example:
const bigHtmlString = await fetchHtmlString(url)
process.stdout.write(bigHtmlString)
// this never gets executed, the process closes. The wait works if I launch the child process directly
await new Promise(resolve => setTimeout(resolve, 1000000))

process.stdout.write(...) returns true/false depending on whether it wrote the string or not. If it returns false, you can listen() to the drain event to make sure it finishes.
Something like this:
const bigHtmlString = await fetchHtmlString(url);
const wrote = process.stdout.write(bigHtmlString);
if (!wrote){
// this effectively means "wait for this
// event to fire", but it doesn't block everything
process.stdout.on('drain', ...doSomethingHere)
}

My suggestion from the comments resolved the issue so I'm posting it as an answer.
I would suggest using spawn instead of exec. The latter buffers the output and flushes it when the process is ended (or the buffer is full) while spawn is streaming the output which is better for huge output like in your case

Nodejs: How can I optimize writing many files?

I'm working in a Node environment on Windows. My code is receiving 30 Buffer objects (~500-900kb each) each second, and I need to save this data to the file system as quickly as possible, without engaging in any work that blocks the receipt of the following Buffer (i.e. the goal is to save the data from every buffer, for ~30-45 minutes). For what it's worth, the data is sequential depth frames from a Kinect sensor.
My question is: What is the most performant way to write files in Node?
Here's pseudocode:
let num = 0
async function writeFile(filename, data) {
fs.writeFileSync(filename, data)
}
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
let filename = 'file-' + num++
// Do anything with data here to optimize write?
writeFile(filename, data)
}
fs.writeFileSync seems much faster than fs.writeFile, which is why I'm using that above. But are there any other ways to operate on the data or write to file that could speed up each save?

First off, you never want to use fs.writefileSync() in handling real-time requests because that blocks the entire node.js event loop until the file write is done.
OK, based on writing each block of data to a different file, then you want to allow multiple disk writes to be in process at the same time, but not unlimited disk writes. So, it's still appropriate to use a queue, but this time the queue doesn't just have one write in process at a time, it has some number of writes in process at the same time:
const EventEmitter = require('events');
class Queue extends EventEmitter {
constructor(basePath, baseIndex, concurrent = 5) {
this.q = [];
this.paused = false;
this.inFlightCntr = 0;
this.fileCntr = baseIndex;
this.maxConcurrent = concurrent;
}
// add item to the queue and write (if not already writing)
add(data) {
this.q.push(data);
write();
}
// write next block from the queue (if not already writing)
write() {
while (!paused && this.q.length && this.inFlightCntr < this.maxConcurrent) {
this.inFlightCntr++;
let buf = this.q.shift();
try {
fs.writeFile(basePath + this.fileCntr++, buf, err => {
this.inFlightCntr--;
if (err) {
this.err(err);
} else {
// write more data
this.write();
}
});
} catch(e) {
this.err(e);
}
}
}
err(e) {
this.pause();
this.emit('error', e)
}
pause() {
this.paused = true;
}
resume() {
this.paused = false;
this.write();
}
}
let q = new Queue("file-", 0, 5);
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
q.add(data);
}
q.on('error', function(e) {
// go some sort of write error here
console.log(e);
});
Things to consider:
Experiment with the concurrent value you pass to the Queue constructor. Start with a value of 5. Then see if raising that value any higher gives you better or worse performance. The node.js file I/O subsystem uses a thread pool to implement asynchronous disk writes so there is a max number of concurrent writes that will allow so cranking the concurrent number up really high probably does not make things go faster.
You can experiement with increasing the size of the disk I/O thread pool by setting the UV_THREADPOOL_SIZE environment variable before you start your node.js app.
Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD on a fast bus would be best.
If you can spread the writes out across multiple actual physical disks, that will likely also increase write throughput (more disk heads at work).
This is a prior answer based on the initial interpretation of the question (before editing that changed it).
Since it appears you need to do your disk writes in order (all to the same file), then I'd suggest that you either use a write stream and let the stream object serialize and cache the data for you or you can create a queue yourself like this:
const EventEmitter = require('events');
class Queue extends EventEmitter {
// takes an already opened file handle
constructor(fileHandle) {
this.f = fileHandle;
this.q = [];
this.nowWriting = false;
this.paused = false;
}
// add item to the queue and write (if not already writing)
add(data) {
this.q.push(data);
write();
}
// write next block from the queue (if not already writing)
write() {
if (!nowWriting && !paused && this.q.length) {
this.nowWriting = true;
let buf = this.q.shift();
fs.write(this.f, buf, (err, bytesWritten) => {
this.nowWriting = false;
if (err) {
this.pause();
this.emit('error', err);
} else {
// write next block
this.write();
}
});
}
}
pause() {
this.paused = true;
}
resume() {
this.paused = false;
this.write();
}
}
// pass an already opened file handle
let q = new Queue(fileHandle);
// This fires 30 times/sec and runs for 30-45 min
dataSender.on('gotData', function(data){
q.add(data);
}
q.on('error', function(err) {
// got disk write error here
});
You could use a writeStream instead of this custom Queue class, but the problem with that is that the writeStream may fill up and then you'd have to have a separate buffer as a place to put the data anyway. Using your own custom queue like above takes care of both issues at once.
Other Scalability/Performance Comments
Because you appear to be writing the data serially to the same file, your disk writing won't benefit from clustering or running multiple operations in parallel because they basically have to be serialized.
If your node.js server has other things to do besides just doing these writes, there might be a slight advantage (would have to be verified with testing) to creating a second node.js process and doing all the disk writing in that other process. Your main node.js process would receive the data and then pass it to the child process that would maintain the queue and do the writing.
Another thing you could experiment with is coalescing writes. When you have more than one item in the queue, you could combine them together into a single write. If the writes are already sizable, this probably doesn't make much difference, but if the writes were small this could make a big difference (combining lots of small disk writes into one larger write is usually more efficient).
Your biggest friend here is disk write speed. So, make sure you have a fast disk with a good disk controller. A fast SSD would be best.

I have written a service that does this extensively and the best thing you can do is to pipe the input data directly to the file (if you have an input stream as well).
A simple example where you download a file in such a way:
const http = require('http')
const ostream = fs.createWriteStream('./output')
http.get('http://nodejs.org/dist/index.json', (res) => {
res.pipe(ostream)
})
.on('error', (e) => {
console.error(`Got error: ${e.message}`);
})
So in this example there is no intermediate copying involved of the whole file. As the file is read in chunks from the remote http server it is written to the file on disk. This is much more efficient that downloading a whole file from the server, saving that in memory and then writing it to a file on disk.
Streams are a basis of many operations in Node.js so you should study those as well.
One other thing that you should investigate depending on your scenarios is UV_THREADPOOL_SIZE as I/O operations use libuv thread pool that is by default set to 4 and you might fill that up if you do a lot of writing.

How to manage a queue in nodejs?

I have written a script in Nodejs that takes a screenshot of websites(using slimerJs), this script takes around 10-20 seconds to complete, the problem here is the server is stalled until this script has is finished.
app.get('/screenshot', function (req, res, next) {
var url = req.query.url;
assert(url, "query param 'url' needed");
// actual saving happens here
var fileName = URL.parse(url).hostname + '_' + Date.now() + '.png';
var command = 'xvfb-run -a -n 5 node slimerScript.js '+ url + ' '+ fileName;
exec(command, function (err, stdout, stderror) {
if(err){ return next(err); }
if(stderror && (stderror.indexOf('error')!= -1) ){ return next(new Error('Error occurred!')); }
return res.send({
status: true,
data: {
fileName: fileName,
url: "http://"+path.join(req.headers.host,'screenshots', fileName)
}
});
})
});
Since the script spawns a firefox browser in memory and loads the website, the ram usage can spike upto 600-700mb, and thus i cannot execute this command asynchronously as ram is expensive on servers.
may i know if its possible to queue the incoming requests and executing them in FIFO fashion?
i tried checking packages like kue, bull and bee-queues, but i think these all assume the job list is already known before the queue is started, where as my job list depends on users using the site, and i wanna also tell people that they are in queue and need to wait for their turn. is this possible with the above mentioned packages?

If I were doing the similar thing, I would try these steps.
1.An array(a queue) to store requested info, when any request come, store those info in the array, and send back a msg to users, telling them they are in the queue, or the server is busy if there are already too many requests.
2.Doing the screen shot job, async, but not all in the same time. You could start the job if you find the queue is empty when a new request comes, and start another recursively when you finish the last one.
function doSceenShot(){
if(a.length > 1){
execTheJob((a[0])=>{
//after finishing the job;
doScreenShot()
})
}
}
3.Notify the user you've finished the job, via polling or other ways.

Reading file in segments of X number of lines

I have a file with a lot of entries (10+ million), each representing a partial document that is being saved to a mongo database (based on some criteria, non-trivial).
To avoid overloading the database (which is doing other operations at the same time), I wish to read in chunks of X lines, wait for them to finish, read the next X lines, etc.
Is there any way to use any of the fscallback-mechanisms to also "halt" progress at a certain point, without blocking the entire program? From what I can tell they will all run from start to finish with no way of stopping it, unless you stop reading the file entirely.
The issues is that because of the file size, memory also becomes an issue and because of the time the updates take, a LOT of the data will be held in memory exceeding the 1 GB limit and causing the program to crash. Secondarily, as I said, I don't want to queue 1 million updates and completely stress the mongo database.
Any and all suggestions welcome.
UPDATE: Final solution using line-reader (available via npm) below, in pseudo-code.
var lineReader = require('line-reader');
var filename = <wherever you get it from>;
lineReader(filename, function(line, last, cb) {
//
// Do work here, line contains the line data
// last is true if it's the last line in the file
//
function checkProcessed(callback) {
if (doneProcessing()) { // Implement doneProcessing to check whether whatever you are doing is done
callback();
}
else {
setTimeout(function() { checkProcessed(callback) }, 100); // Adjust timeout according to expecting time to process one line
}
}
checkProcessed(cb);
});
This is implemented to make sure doneProcessing() returns true before attempting to work on more lines - this means you can effectively throttle whatever you are doing.

I don't use MongoDB and I'm not an expert in using Lazy, but I think something like below might work or give you some ideas. (note that I have not tested this code)
var fs = require('fs'),
lazy = require('lazy');
var readStream = fs.createReadStream('yourfile.txt');
var file = lazy(readStream)
.lines // ask to read stream line by line
.take(100) // and read 100 lines at a time.
.join(function(onehundredlines){
readStream.pause(); // pause reading the stream
writeToMongoDB(onehundredLines, function(err){
// error checking goes here
// resume the stream 1 second after MongoDB finishes saving.
setTimeout(readStream.resume, 1000);
});
});
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string