I have a node application which handles JSON files: it reads, parses files and writes new files. And sometimes, by necessary, the files become a massive swarm. First, I think current reading speed looks reasonalbe, but writing speed seems little bit slow.
I'd like to improve this processing speed.
Before I touch this program, I'd tried multi-threading to my python application first, it does similar tasks but handles image files, and the threading successfully reduced its response time.
I wonder if it's okay to use node's worker_thread to get the same effect. Because Node document says
They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.
https://nodejs.org/api/worker_threads.html
The problem is the truth that I don't know whether the current speed is the fastest which the node environment could show or still enhancable without worker_thread.
These are my attempts for imporvemnt: My program reads and writes files one by one from a list of file's paths, with fs-sync functions - readFileSync(), writeFileSync(). First, I thought accessing many files synchronously is not node-ish, so I promisified fs functions(readFile(), writeFile()) and pushed to a list of promise objects. Then I call await Promise.all(promisesList). But this didn't help at all. Even slower.
For the second try, I gave up generating tones of promises, and made a single promise. It kept watching the number of processed files, and call resolve() when the number is equal with the length of total files.
const waiter = new Promise<boolean>((resolve, rejects) => {
const loop: () => void = () =>
processedCount === fileLen ? resolve(true) : setTimeout(loop);
loop();
});
I had only waited this promise, and this was the slowest.
Now I think this shows the "asynchronous" does not mean "parallel". So, am I misunderstanding the document's explanation? And should I use worker_threads to improve the file IO speed in this case? Or is there any better solution? Maybe it could be the answer not to use Node for these kind of process, I'd love to but today is Nov 25th sadly...
The real bottleneck here will be the file system implementation. Running up multiple threads to read and / or write multiple files in parallel will give you some speedup, but you quickly run into the file system bottleneck.
As a general rule, typical file systems do not handle the use-case of "gazzillions of tiny files" well. And it gets progressively worse if the files are on slow disks, a remote file system, a hierarchical storage system, etc.
The best solution is to redesign your application so that it doesn't organize its data like that. Better alternatives involve combinations of:
using an SQL or NOSQL database to store the data
using a flat-file database like SQLite or BDB
reading and writing TAR or ZIP archives
storing / buffering the data in memory.
If you are trying to get better performance, Python is not where you should look. For a start, a CPU bound multi-threaded application is effectively constrained to a single core ... due to the GIL. And your python code is typically not compiled to native code.
A language like C, C++ or even Java would be a better choice. But parallelizing an application's file I/O is difficult and the results tend to be disappointing. It is generally better to do it a different way; i.e. avoid application architectures that use lots of little files.
Have you tried node streams API. Also there is JSONStream npm package to parse json stream data. Please have look.
const fs = require('fs');
let sourceFileStream = fs.createReadStream('./file1.json')
let destinationFileStream = fs.createWriteStream('./temp/file1.json')
sourceFileStream.pipe(destinationFileStream)
I'm using snowflake-sdk and snowflake-promise to stream results (to avoid loading too many objects in memory).
For each streamed row, I want to process the received information (an ETL-like job that performs write-backs). My code is quite basic and similar to this simplistic snowflake-promise example.
My current problem is that .on('data', ...) is called more often than I can manage to handle. (My ETL-like job can't keep up with the received rows and my DB connection pool to perform write-backs gets exhausted).
I tried setting rowStreamHighWaterMark to various values (1, 10 [default], 100, 1000, 2000 and 4000) in an effort to slow down/backpressure stream.Readable but, unfortunately, it didn't change anything.
What did I miss ? How can I better control when to consume the read data ?
If this was written synchronous, you would see that to "be pushed too much data" than you can handled to write at the same time" cannot happen because:
while(data){
data.readrow()
doSomethineAwesome()
writeDataViaPoolTheBacksUp()
}
just can not spin to fast.
Now if you are accepting data on one async thread, and pushing that data onto a queue and draining the queue in another async thread, you will get the problem you discribe (that is your queue explodes). So you need to slow/pause the completion of the read's thread when the write thread is too behind.
Given to is writing to the assumed queue, when that gets too long, stop.
The other way you might be doing this is with no work queue, but fire a async write each time conditions are meet. This is bad because you have no track of outstand work, and you are doing many small updates to the DB, which if is Snowflake it really dislikes. A better approach would be to build a local set of data changes, we will call it a batch, and when you batch get to a size you flush the changes set in one operation (and you flush the batch when input is completed, to catch the dregs)
The Snowflake support got back to me with an answer.
They told me to create the connection this way:
var connection = snowflake.createConnection({
account: "testaccount",
username: "testusername",
password: "testpassword",
rowStreamHighWaterMark: 5
});
Full disclaimer: My project has changed and I could NOT recreate the problem on my local environment. I couldn't assess the answer's validity; still, I wanted to share in case somebody could get some hints from this information.
I've a readable NodeJS Stream which I want to use twice. Disclaimer: I'm not very comfortable with streams
Why?
My Service allows uploading of images for users. I want to avoid uploading of the same images.
My workflow is as follows:
upload image per ajax
get hash of image
if hash in database
return url from database
else
pass hash to resize&optimize pipeline
upload image to s3 bucket
get hash of image and write it to database with url
return s3 url
I get the hash of my stream with hashstream and optimize my image with gm.
Hashstream takes a stream, closes it, creates a hash and returns it with a callback.
My question is: What would be the best approach to combine both methods?
There are two ways to solve it:
Buffer the stream
Since you don't know if your stream will be used again, you can simply buffer it up somehow (somehow meaning handling data events, or using some module, for
example accum). As soon as you know what the outcome of the hash function you'd simply write the whole accumulated buffer into the gm stream.
Use stream.pipe twice to "tee"
You probably know the posix command tee, likewise you can push all the data into two places. Here's an example implementation of a tee method in my "scramjet" stream, but I guess for you it'd be quite sufficient to simply pipe twice. Then as soon as you get your hash calculated and run into the first condition I'd simply send an end.
The right choice depends on if you want to conserve memory or CPU. For less memory use two pipes (your optimization process will start, but you'll cancel it before it would output anything). For less CPU and less processes usage I'd go for buffering.
All in all I would consider buffering only if you can easily scale to more incoming images or you know exactly how much load there is and you can handle it. Either way there will be limits and these limit need to be somehow handled, if you can start couple more instances then you should be better of with using more CPU and keeping the memory at a sensible level.
I have a question about Node.js streams - specifically how they work conceptually.
There is no lack of documentation on how to use streams. But I've had difficulty finding how streams work at the data level.
My limited understanding of web communication, HTTP, is that full "packages" of data are sent back and forth. Similar to an individual ordering a company's catalogue, a client sends a GET (catalogue) request to the server, and the server responds with the catalogue. The browser doesn't receive a page of the catalogue, but the whole book.
Are node streams perhaps multipart messages?
I like the REST model - especially that it is stateless. Every single interaction between the browser and server is completely self contained and sufficient. Are node streams therefore not RESTful? One developer mentioned the similarity with socket pipes, which keep the connection open. Back to my catalogue ordering example, would this be like an infomercial with the line "But wait! There's more!" instead of the fully contained catalogue?
A large part of streams is the ability for the receiver 'down-stream' to send messages like 'pause' & 'continue' upstream. What do these messages consist of? Are they POST?
Finally, my limited visual understanding of how Node works includes this event loop. Functions can be placed on separate threads from the thread pool, and the event loop carries on. But shouldn't sending a stream of data keep the event loop occupied (i.e. stopped) until the stream is complete? How is it ALSO keeping watch for the 'pause' request from downstream?n Does the event loop place the stream on another thread from the pool and when it encounters a 'pause' request, retrieve the relevant thread and pause it?
I've read the node.js docs, completed the nodeschool tutorials, built a heroku app, purchased TWO books (real, self contained, books, kinda like the catalogues spoken before and likely not like node streams), asked several "node" instructors at code bootcamps - all speak about how to use streams but none speak about what's actually happening below.
Perhaps you have come across a good resource explaining how these work? Perhaps a good anthropomorphic analogy for a non CS mind?
The first thing to note is: node.js streams are not limited to HTTP requests. HTTP requests / Network resources are just one example of a stream in node.js.
Streams are useful for everything that can be processed in small chunks. They allow you to process potentially huge resources in smaller chunks that fit into your RAM more easily.
Say you have a file (several gigabytes in size) and want to convert all lowercase into uppercase characters and write the result to another file. The naive approach would read the whole file using fs.readFile (error handling omitted for brevity):
fs.readFile('my_huge_file', function (err, data) {
var convertedData = data.toString().toUpperCase();
fs.writeFile('my_converted_file', convertedData);
});
Unfortunately this approch will easily overwhelm your RAM as the whole file has to be stored before processing it. You would also waste precious time waiting for the file to be read. Wouldn't it make sense to process the file in smaller chunks? You could start processing as soon as you get the first bytes while waiting for the hard disk to provide the remaining data:
var readStream = fs.createReadStream('my_huge_file');
var writeStream = fs.createWriteStream('my_converted_file');
readStream.on('data', function (chunk) {
var convertedChunk = chunk.toString().toUpperCase();
writeStream.write(convertedChunk);
});
readStream.on('end', function () {
writeStream.end();
});
This approach is much better:
You will only deal with small parts of data that will easily fit into your RAM.
You start processing once the first byte arrived and don't waste time doing nothing, but waiting.
Once you open the stream node.js will open the file and start reading from it. Once the operating system passes some bytes to the thread that's reading the file it will be passed along to your application.
Coming back to the HTTP streams:
The first issue is valid here as well. It is possible that an attacker sends you large amounts of data to overwhelm your RAM and take down (DoS) your service.
However the second issue is even more important in this case:
The network may be very slow (think smartphones) and it may take a long time until everything is sent by the client. By using a stream you can start processing the request and cut response times.
On pausing the HTTP stream: This is not done at the HTTP level, but way lower. If you pause the stream node.js will simply stop reading from the underlying TCP socket.
What is happening then is up to the kernel. It may still buffer the incoming data, so it's ready for you once you finished your current work. It may also inform the sender at the TCP level that it should pause sending data. Applications don't need to deal with that. That is none of their business. In fact the sender application probably does not even realize that you are no longer actively reading!
So it's basically about being provided data as soon as it is available, but without overwhelming your resources. The underlying hard work is done either by the operating system (e.g. net, fs, http) or by the author of the stream you are using (e.g. zlib which is a Transform stream and usually bolted onto fs or net).
The below chart seems to be a pretty accurate 10.000 feet overview / diagram for the the node streams class.
It represents streams3, contributed by Chris Dickinson.
So first of all, what are streams?
Well, with streams we can process meaning read and write data piece by piece without completing the whole read or write operation. Therefore we don't have to keep all the data in memory to do these operations.
For example, when we read a file using streams, we read part of the data, do something with it, then free our memory, and repeat this until the entire file has been processed. Or think of YouTube or Netflix, which are both called streaming companies because they stream video using the same principle.
So instead of waiting until the entire video file loads, the processing is done piece by piece or in chunks so that you can start watching even before the entire file has been downloaded. So the principle here is not just about Node.JS. But universal to computer science in general.
So as you can see, this makes streams the perfect candidate for handing large volumes of data like for example, video or also data that we're receiving piece by piece from an external source. Also, streaming makes the data processing more efficient in terms of memory because there is no need to keep all the data in memory and also in terms of time because we can start processing the data as it arrives, rather than waiting until everything arrives.
How they are implemented in Node.JS:
So in Node, there are four fundamental types of streams:
readable streams, writable streams, duplex streams, and transform streams. But the readable and writeable ones are the most important ones, readable streams are the ones from which we can read and we can consume data. Streams are everywhere in the core Node modules, for example, the data that comes in when an http server gets a request is actually a readable stream. So all the data that is sent with the request comes in piece by piece and not in one large piece. Also, another example from the file system is that we can read a file piece by piece by using a read screen from the FS module, which can actually be quite useful for large text files.
Well, another important thing to note is that streams are actually instances of the EventEmitter class. Meaning that all streams can emit and listen to named events. In the case of readable streams, they can emit, and we can listen to many different events. But the most important two are the data and the end events. The data event is emitted when there is a new piece of data to consume, and the end event is emitted as soon as there is no more data to consume. And of course, we can then react to these events accordingly.
Finally, besides events, we also have important functions that we can use on streams. And in the case of readable streams, the most important ones are the pipe and the read functions. The super important pipe function, which basically allows us to plug streams together, passing data from one stream to another without having to worry much about events at all.
Next up, writeable streams are the ones to which we can write data. So basically, the opposite of readable streams. A great example is the http response that we can send back to the client and which is actually a writeable stream. So a stream that we can write data into. So when we want to send data, we have to write it somewhere, right? And that somewhere is a writeable stream, and that makes perfect sense, right?
For example, if we wanted to send a big video file to a client, we would just like Netflix or YouTube do. Now about events, the most important ones are the drain and the finish events. And the most important functions are the write and end functions.
About duplex streams. They're simply streams that are both readable and writeable at the same time. These are a bit less common. But anyway, a good example would be a web socket from the net module. And a web socket is basically just a communication channel between client and server that works in both directions and stays open once the connection has been established.
Finally, transform streams are duplex streams, so streams that are both readable and writeable, which at the same time can modify or transform the data as it is read or written. A good example of this one is the zlib core module to compress data which actually uses a transform stream.
*** Node implemented these http requests and responses as streams, and we can then consume, we can use them using the events and functions that are available for each type of stream. We could of course also implement our own streams and then consume them using these same events and functions.
Now let's try some example:
const fs = require('fs');
const server = require('http').createServer();
server.on('request', (req, res) =>{
fs.readFile('./txt/long_file.txt', (err, data)=>{
if(err) console.log(err);
res.end(data);
});
});
server.listen('8000','127.0.01', ()=>{
console.log(this);
});
Suppose long_file.txt file contain 1000000K lines and each line contain more thean 100 words, so this is a hug file with a big chunk of data, now in the above example problem is by using readFile() function node will load entire file into memory, because only after loading the whole file into memory node can transfar the data as a responce object.
When the file is big, and also when there are a ton of request hitting your server, by means of time node process will very quickly run out of resources and your app will quit working, everything will crash.
Let's try to find a solution by using stream:
const fs = require('fs');
const server = require('http').createServer();
server.on('request', (req, res) =>{
const readable = fs.createReadStream('./txt/long_file.txt');
readable.on('data', chunk=>{
res.write(chunk);
});
readable.on('end',()=>{
res.end();
})
readable.on('error', err=>{
console.log('err');
res.statusCode=500;
res.end('File not found');
});
});
server.listen('8000','127.0.01', ()=>{
console.log(this);
});
Well in the above example with the stream, we are effectively streaming the file, we are reading one piece of the file, and as soon as that's available, we send it right to the client, using the write method of the respond stream. Then when the next pice is available then that piece will be sent, and all the way until the entire file is read and streamed to the client.
So the stream is basically finished reading the data from the file, the end event will be emitted to signals that no more data will be written to this writable stream.
With the above practice, we solved previous problem, but still, there is a huge problem remain with the above example which is called backpressure.
The problem is that our readable stream, the one that we are using to read files from the disk, is much much faster than actually sending the result with the response writable stream over the network. And this will overwhelm the response stream, which cannot handle all this incoming data so fast and this problem is called backpressure.
The solution is using the pipe operator, it will handle the speed of data coming in and speed of data going out.
const fs = require('fs');
const server = require('http').createServer();
server.on('request', (req, res) =>{
const readable = fs.createReadStream('./txt/long_file.txt');
readable.pipe(res);
});
server.listen('8000','127.0.01', ()=>{
console.log(this);
});
I think you are overthinking how all this works and I like it.
What streams are good for
Streams are good for two things:
when an operation is slow and it can give you partials results as it gets them. For example read a file, it is slow because HDDs are slow and it can give you parts of the file as it reads it. With streams you can use these parts of the file and start to process them right away.
they are also good to connect programs together (read functions). Just as in the command line you can pipe different programs together to produce the desired output. Example: cat file | grep word.
How they work under the hood...
Most of these operations that take time to process and can give you partial results as it gets them are not done by Node.js they are done by the V8 JS Engine and it only hands those results to JS for you to work with them.
To understand your http example you need to understand how http works
There are different encodings a web page can be send as. In the beginning there was only one way. Where a whole page was sent when it was requested. Now it has more efficient encodings to do this. One of them is chunked where parts of the web page are sent until the whole page is sent. This is good because a web page can be processed as it is received. Imagine a web browser. It can start to render websites before the download is complete.
Your .pause and .continue questions
First, Node.js streams only work within the same Node.js program. Node.js streams can't interact with a stream in another server or even program.
That means that in the example below, Node.js can't talk to the webserver. It can't tell it to pause or resume.
Node.js <-> Network <-> Webserver
What really happens is that Node.js asks for a webpage and it starts to download it and there is no way to stop that download. Just dropping the socket.
So, what really happens when you make in Node.js .pause or .continue?
It starts to buffer the request until you are ready to start to consume it again. But the download never stopped.
Event Loop
I have a whole answer prepared to explain how the Event Loop works but I think it is better for you to watch this talk.
So I have a backend implementation in node.js which mainly contains a global array of JSON objects. The JSON objects are populated by user requests (POSTS). So the size of the global array increases proportionally with the number of users. The JSON objects inside the array are not identical. This is a really bad architecture to begin with. But I just went with what I knew and decided to learn on the fly.
I'm running this on a AWS micro instance with 6GB RAM.
How to purge this global array before it explodes?
Options that I have thought of:
At a periodic interval write the global array to a file and purge. Disadvantage here is that if there are any clients in the middle of a transaction, that transaction state is lost.
Restart the server every day and write the global array into a file at that time. Same disadvantage as above.
Follow 1 or 2, and for every incoming request - if the global array is empty look for the corresponding JSON object in the file. This seems absolutely absurd and stupid.
Somehow I can't think of any other solution without having to completely rewrite the nodejs application. Can you guys think of any .. ? Will greatly appreciate any discussion on this.
I see that you are using memory as a storage. If that is the case and your code is synchronous (you don't seem to use database, so it might), then actually solution 1. is correct. This is because JavaScript is single-threaded, which means that when one code is running the other cannot run. There is no concurrency in JavaScript. This is only a illusion, because Node.js is sooooo fast.
So your cleaning code won't fire until the transaction is over. This is of course assuming that your code is synchronous (and from what I see it might be).
But still there are like 150 reasons for not doing that. The most important is that you are reinventing the wheel! Let the database do the hard work for you. Using proper database will save you all the trouble in the future. There are many possibilites: MySQL, PostgreSQL, MongoDB (my favourite), CouchDB and many many other. It shouldn't matter at this point which one. Just pick one.
I would suggest that you start saving your JSON to a non-relational DB like http://www.couchbase.com/.
Couchbase is extremely easy to setup and use even in a cluster. It uses a simple key-value design so saving data is as simple as:
couchbaseClient.set("someKey", "yourJSON")
then to retrieve your data:
data = couchbaseClient.set("someKey")
The system is also extremely fast and is used by OMGPOP for Draw Something. http://blog.couchbase.com/preparing-massive-growth-revisited