The textbook way to read a file line-by-line in NodeJS seems to be to call readline.createInterface, and then afterward attach event handlers for line and close.
There doesn't seem to be anything to "start" the reader. It just goes, and seems to work perfectly. How does it know when to start reading? How does it guarantee that those events, which don't exist yet, will always pick up every line in the file?
I always assumed that it just all happened so fast that the events get attached faster than it takes to open the file from disk and start reading it - but that doesn't really hold up.
For example, suppose I put some heavy CPU-consuming code after the lineReader has been created, but before the events attached. It still seems to work, and the event still fires for each line. How did it "wait" until the heavy stuff was done before it started reading? If I don't attach the line event, then it runs anyway and the close event still fires, so it's not like it's waiting for the line event to be created.
var lineReader = readline.createInterface({
input: fs.createReadStream("input.txt")
});
// EVENTS HAVE NOT BEEN CREATED YET
lineReader.on("line", line => { console.log(line); });
lineReader.on("close", () => { console.log("DONE"); });
This isn't specific to lineReader - seems to be a common Node pattern - this is just the easiest to define and run.
Internally, readline.createInterface() is creating a stream. Streams, by default, are paused. They unpause themselves in a number of ways and what's relevant here is when a data event listener is added.
And, inside of readline.createInterface(), a data event handler is added. That starts the stream flowing and it will start emitting data events which the readline code will parse into line events.
Also because node.js and streams are event driven and node.js runs your Javascript as single threaded, that means that no events will occur until your setup code finishes executing. Internally, node.js may have already started reading the file (using asynchronous I/O and threads internally), but even if it finishes the first read from the file before your setup code finishes executing, all it will do is insert a data event in the event queue. node.js won't process that data event until your setup code is done executing and has returned control back to the node.js event loop.
Then, the data event callback will be called, the readline code will parse the data from that first event and if there is a full line in that first data event, it will then trigger a line event.
There doesn't seem to be anything to "start" the reader.
Attaching a data event handler on the readStream (internal to the readline code) is what tells the stream to start flowing.
It just goes, and seems to work perfectly. How does it know when to start reading?
Same as above.
How does it guarantee that those events, which don't exist yet, will always pick up every line in the file?
The readline code receives raw data from the file in its data event handler. It then parses that code into lines and emits line events for each line that it finds. When a file read crosses a line boundary, it must buffer a partial line and wait for the rest of the line to come on the next data event from the stream.
When the linereader code sees that the stream is done reading and there are no more bytes, it sends the last line (if there is one in the buffer) and then issues the close event to tell the listener that its all done.
For example, suppose I put some heavy CPU-consuming code after the lineReader has been created, but before the events attached. It still seems to work, and the event still fires for each line. How did it "wait" until the heavy stuff was done before it started reading?
This is because node.js is event-driven. The first data event from the stream (internal to the readline code) is the result of an fs.readFile() function and that notifies completion through the event queue. An event in the event queue will not be processed until the current piece of Javascript finishes and returns control back to the event loop (at which point it will then service the next event waiting in the event queue). So, no matter how much have CPU-consuming code you have before you attach the event handlers, the internals of readline won't be told about the first data read from the file until all that is done.
It is this single-threaded, event-driven nature that ensures that you get to install your event listeners before those events can be triggered so there's no way you can miss them.
If I don't attach the line event, then it runs anyway and the close event still fires, so it's not like it's waiting for the line event to be created.
Correct. The readline code attaches the data event handler inside the createInterface() call, whether you have a line event listener or not. So, the stream will start flowing and the file will get read whether you have a line event handler or not.
FYI, one way you can help answers these questions yourself is to just go look at the node.js code and see how it works. That's what I did here. Here's a link to the createInterface() function where you can see what I've described here.
And, you can see here in the stream doc, where is describes the three ways that a stream starts flowing, one of which is the attaching of a data event listener.
Related
I have a JSON sent in a request body that should get manipulated. But the two req.ons, which write and end the data, get executed, as the command prompt shows, after the rest of the function in which they are in, so I don't get my JSON and the rest of the program throws error.
I'm familiar with asynchronocity but have no idea how to get these two straight:
req.on('data', function(data){
buffer += decoder.write(data);
});
req.on('end', function(data){
buffer += decoder.end(data);
})
Been there for four hours and haven't found the solution.
Yes, the data and end events are asynchronous. That means they come some time later, long after you've registered the event handlers. And, .on() is non-blocking so it registers the event handler and then immediately returns and code placed after your calls to .on() will execute right after the event handlers are installed. The callbacks for those events will get called sometime in the future, if/when that event occurs.
Code that you want to run when all the data has been collected must be inserted into the end event handler or put in a function that you call from there.
Here's an analogy. You go to a restaurant to eat, it's full and has a waiting list.
So, you give them your phone number and they say they will text you when your table is ready. That's like you registering for the data event.
Meanwhile, you go walk down the street for awhile and window shop to kill some time. That's the rest of your code after the .on() calls executing. You run into a friend who says he wants to hang out after you're done with dinner. He gives you his number and asks you to text him when you're done with dinner. This is like when you register a handler for the end event.
You continue walking around downtown for awhile longer. Then, sometime later, you receive the text message that your table is ready. That's like the callback for the data event getting called to tell you there is data ready.
You eat dinner and when you finish dinner, you text your friend that you're done. That's like the end event.
Notice that interwoven between these various events, other regular things get to run (walking around town, talking to your friend, eating dinner, etc...). This is like your other code that runs while waiting for those events to get called.
Code that wants to run in very specific times such as when one of those events occurs, must be in the specific event handler for that event.
I'm rather new to event based programming. I'm experimenting with epoll's edge-mode which apparently only signals files which have become ready for read/write (as opposed to level-mode which signals all ready files, regardless of whether there were already ready, or just became ready).
What's not clear to me, is: in edge-mode, am I informed of readiness events that happen while I'm not epoll_waiting ? What about events on one-shot files that haven't been rearmed yet ?
To illustrate why I'm asking that, consider the following scenario:
have 10 non-blocking sockets connected
configure epoll_ctl to react when the sockets are ready for read, in edge-mode + oneshot : EPOLLET | EPOLLONESHOT | EPOLLIN
epoll_wait for something to happen (reports max 10 events)
linux wakes my process and reports sockets #1 and #2 are ready
I read and process data socket #1 (until E_AGAIN)
I read and process data socket #2 (until E_AGAIN)
While I'm doing that, a socket S receives data
I processed all events, so I rearm the triggered files with epoll_ctl in EPOLL_CTL_MOD mode, because of oneshot
my loop goes back to epoll_waiting the next batch of events
Ok, so will the last epoll_wait always be notified of the readiness of socket S ? Event if S is #1 (i.e. it's not rearmed) ?
I'm experimenting with epoll's edge-mode which apparently only signals
files which have become ready for read/write (as opposed to level-mode
which signals all ready files, regardless of whether there were
already ready, or just became ready)
First let's get a clear view of the system, you need an accurate mental model of how the system works. Your view of epoll(7) is not really accurate.
The difference between edge-triggered and level-triggered is the definition of what exactly makes an event. The former generates one event for each action that has been subscribed on the file descriptor; once you consume the event, it is gone - even if you didn't consume all the data that generated such an event. OTOH, the latter keeps generating the same event over and over until you consume all the data that generated the event.
Here's an example that puts these concepts in action, blatantly stolen from man 7 epoll:
The file descriptor that represents the read side of a pipe (rfd) is registered on the epoll instance.
A pipe writer writes 2 kB of data on the write side of the pipe.
A call to epoll_wait(2) is done that will return rfd as a ready file descriptor.
The pipe reader reads 1 kB of data from rfd.
A call to epoll_wait(2) is done.
If the rfd file descriptor has been added to the epoll interface using
the EPOLLET (edge-triggered) flag, the call to epoll_wait(2) done in
step 5 will probably hang despite the available data still present in
the file input buffer; meanwhile the remote peer might be expecting a
response based on the data it already sent. The reason for this is
that edge-triggered mode delivers events only when changes occur on
the monitored file descriptor. So, in step 5 the caller might end up
waiting for some data that is already present inside the input buffer.
In the above example, an event on rfd will be generated because of the
write done in 2 and the event is consumed in 3. Since the read
operation done in 4 does not consume the whole buffer data, the call
to epoll_wait(2) done in step 5 might block indefinitely.
In short, the fundamental difference is in the definition of "event": edge-triggered treats events as a single unit that you consume once; level-triggered defines the consumption of an event as being equivalent to consuming all of the data belonging to that event.
Now, with that out of the way, let's address your specific questions.
in edge-mode, am I informed of readiness events that happen while I'm
not epoll_waiting
Yes, you are. Internally, the kernel queues up the interesting events that happened on each file descriptor. They are returned on the next call to epoll_wait(2), so you can rest assured that you won't lose events. Well, maybe not exactly on the next call if there are other events pending and the events buffer passed to epoll_wait(2) can't accommodate them all, but the point is, eventually these events will be reported.
What about events on one-shot files that haven't been rearmed yet?
Again, you never lose events. If the file descriptor hasn't been rearmed yet, should any interesting event arise, it is simply queued in memory until the file descriptor is rearmed. Once it is rearmed, any pending events - including those that happened before the descriptor was rearmed - will be reported in the next call to epoll_wait(2) (again, maybe not exactly the next one, but they will be reported). In other words, EPOLLONESHOT does not disable event monitoring, it simply disables event notification temporarily.
Ok, so will the last epoll_wait always be notified of the readiness of
socket S? Event if S is #1 (i.e. it's not rearmed)?
Given what I said above, by now it should be pretty clear: yes, it will. You won't lose any event. epoll offers strong guarantees, it's awesome. It's also thread-safe and you can wait on the same epoll fd in different threads and update event subscription concurrently. epoll is very powerful, and it is well worth taking the time to learn it!
I have a ClearTextStream for a TLS connection and I want to check if "end" was already called. The actual problem is, that I'm trying to write something into the stream and I get an "write after end" error.
Now to avoid that, I just want to check if "end" was already called. I do have an "close" event, but it isn't fired in all cases.
I can't find it in the documentation and I couldn't find anything like that by googling.
I could check the error event (which is throwing "write after end" for me) and handle the situation there - but is there really no way to check this in the beginning?
Thanks!
If you get a write after end error, that means that you are trying to write data to a Writable stream that has been closed (ie. that can't accept anymore input data). When a writable stream closes, the finish event is emitted (see the documentation). On the other hand, the close event is emitted by a Readable stream, when the underlying resource is closed (for instance when the file descriptor you are reading is closed).
As a ClearTextStream is a Duplex stream, it can emit both close and finish events, but they don't mean the same thing. In your particular case, you should listen to the finish event and react appropriately.
Another solution would be to check the this.ended and this.finished booleans (see the source code), but I wouldn't recommend that as they are private variables and only reflect the implementation details, not the public API.
I've often heard of Streams2 and old-streams, but what is Streams3? It get mentioned in this talk by Thorsten Lorenz.
Where can I read about it, and what is the difference between Streams2 and Streams3.
Doing a search on Google, I also see it mentioned in the Changelog of Node 0.11.5,
stream: Simplify flowing, passive data listening (streams3) (isaacs)
I'm going to give this a shot, but I've probably got it wrong. Having never written Streams1 (old-streams) or Streams2, I'm probably not the right guy to self-answer this one, but here it goes. It seems as if there is Streams1 API that still persists to some degree. In Streams2, there are two modes of streams flowing (legacy), and non-flowing. In short, the shim that supported flowing mode is going away. This was the message that lead to the patch now called called Streams3,
Same API as streams2, but remove the confusing modality of flowing/old
mode switch.
Every time read() is called, and returns some data, a data event fires.
resume() will make it call read() repeatedly. Otherwise, no change.
pause() will make it stop calling read() repeatedly.
pipe(dest) and on('data', fn) will automatically call resume().
No switches into old-mode. There's only flowing, and paused. Streams start out paused.
Unfortunately, to understand any of description which defines Streams3 pretty well, you need to first understand Streams1, and the legacy streams
Backstory
First, let's take a look at what the Node v0.10.25 docs say about the two modes,
Readable streams have two "modes": a flowing mode and a non-flowing mode. When in flowing mode, data is read from the underlying system and provided to your program as fast as possible. In non-flowing mode, you must explicitly call stream.read() to get chunks of data out. — Node v0.10.25 Docs
Isaac Z. Schlueter said in November slides I dug up:
streams2
"suck streams"
Instead of 'data' events spewing, call read() to pull data from source
Solves all problems (that we know of)
So it seems as if in streams1, you'd create an object and call .on('data', cb) to that object. This would set the event to be trigger, and then you were at the mercy of the stream. In Streams2 internally streams have buffers and you request data from those streams explicitly (using `.read). Isaac goes on to specify how backwards compat works in Streams2 to keep Streams1 (old-stream) modules functioning
old-mode streams1 shim
New streams can switch into old-mode, where they spew 'data'
If you add a 'data' event handler, or call pause() or resume(), then switch
Making minimal changes to existing tests to keep us honest
So in Streams2, a call to .pause() or .resume() triggers the shim. And, it should, right? In Streams2 you have control over when to .read(), and you're not catching stuff being thrown at you. This triggered a legacy mode that acted independently of Streams2.
Let's take an example from Isaac's slide,
createServer(function(q,s) {
// ADVISORY only!
q.pause()
session(q, function(ses) {
q.on('data', handler)
q.resume()
})
})
In Streams1, q starts up right away reading and emitting (likely losing data), until the call to q.pause advises q to stop pulling in data but not from emitting events to clear what it already read.
In Streams2, q starts off paused until the call to .pause() which signifies to emulate the old mode.
In Streams3, q starts off as paused having never read from the file handle making the q.pause() a noop, and on the call to q.on('data', cb) will call q.resume until there is no more data in the buffer. And, then call again q.resume doing the same thing.
Seems like Streams3 was introduced in io.js, then in Node 0.11+
Streams 1 Supported data being pushed to a stream. There was no consumer control, data was thrown at the consumer whether it was ready or not.
Streams 2 allows data to be pushed to a stream as per Streams 1, or for a consumer to pull data from a stream as needed. The consumer could control the flow of data in pull mode (using stream.read() when notified of available data). The stream can not support both push and pull at the same time.
Streams 3 allows pull and push data on the same stream.
Great overview here:
https://strongloop.com/strongblog/whats-new-io-js-beta-streams3/
A cached version (accessed 8/2020) is here: https://hackerfall.com/story/whats-new-in-iojs-10-beta-streams-3
I suggest you read the documentation, more specifically the section "API for Stream Consumers", it's actually very understandable, besides I think the other answer is wrong: http://nodejs.org/api/stream.html#stream_readable_read_size
First of all, I am starter trying to understand what is Node.Js. I have two questions.
First Question
From the article of Felix, it said "there can only be one callback firing at the same time. Until that callback has finished executing, all other callbacks have to wait in line".
Then, consider about the following code (copied from nodejs official website)
var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(8124, "127.0.0.1");
If two client requests are received simultaneously, it means the following workflow:
First http request event received, Second request event received.
As soon as first event received, callback function for first event is executing.
At while, callback function for second event has to be waiting.
Am I right? If I am right, how Node.js control if there are thousands of client request within very short-time duration.
Second Question
The term "Event Loop" is mostly used in Node.js topic. I have understood "Event Loop" as the following from http://www.wisegeek.com/what-is-an-event-loop.htm;
An event loop - or main loop, is a construct within programs that
controls and dispatches events following an initial event.
The initial event can be anything, including pushing a button on a
keyboard or clicking a button on a program (in Node.js, I think the
initial events will be http request, db queries or I/O file access).
This is called a loop, not because the event circles and happens
continuously, but because the loop prepares for an event, checks the
event, dispatches an event and repeats the process all over again.
I have a conflict about second paragraph especially the phrase"repeats the process all over again". I accepted that the above http.createServer code from above question is absolutely "event loop" because it repeatedly listens the http request events.
But I don't know how to identify the following code as whether event-driven or event loop. It does not repeat anything except the callback function fired after db query is finished.
database.query("SELECT * FROM table", function(rows) {
var result = rows;
});
Please, let me hear your opinions and answers.
Answer one, your logic is correct: second event will wait. And will execute as far as its queued callbacks time comes.
As well, remember that there is no such thing as "simultaneously" in technical world. Everything have very specific place and time.
The way node.js manages thousands of connections is that there is no need to hold thread idling while there is some database call blocking the logic, or another IO operation is processing (like streams for example). It can "serve" first request, maybe creating more callbacks, and proceed to others.
Because there is no way to block the execution (except nonsense while(true) and similar), it becomes extremely efficient in spreading actual resources all over application logic.
Threads - are expensive, and server capacity of threads is directly related to available Memory. So most of classic web applications would suffer just because RAM is used on threads that are simply idling while there is database query block going on or similar. In node that's not a case.
Still, it allows to create multiple threads (as child_process) through cluster, that expands even more possibilities.
Answer Two. There is no such thing as "loop" that you might thinking about. There will be no loop behind the scenes that does checks if there is connections or any data received and so on. It is nowadays handled by Async methods as well.
So from application point of view, there is no 'main loop', and everything from developer point of view is event-driven (not event-loop).
In case with http.createServer, you bind callback as response to requests. All socket operations and IO stuff will happen behind the scenes, as well as HTTP handshaking, parsing headers, queries, parameters, and so on. Once it happens behind the scenes and job is done, it will keep data and will push callback to event loop with some data. Once event loop ill be free and will come time it will execute in node.js application context your callback with data from behind the scenes.
With database request - same story. It ill prepare and ask stuff (might do it even async again), and then will callback once database responds and data will be prepared for application context.
To be honest, all you need with node.js is to understand the concept, but not implementation of events.
And the best way to do it - experiment.
1) Yes, you are right.
It works because everything you do with node is primarily I/O bound.
When a new request (event) comes in, it's put into a queue. At initialization time, Node allocates a ThreadPool which is responsible to spawn threads for I/O bound processing, like network/socket calls, database, etc. (this is non-blocking).
Now, your "callbacks" (or event handlers) are extremely fast because most of what you are doing is most likely CRUD and I/O operations, not CPU intensive.
Therefore, these callbacks give the feeling that they are being processed in parallel, but they are actually not, because the actual parallel work is being done via the ThreadPool (with multi-threading), while the callbacks per-se are just receiving the result from these threads so that processing can continue and send a response back to the client.
You can easily verify this: if your callbacks are heavy CPU tasks, you can be sure that you will not be able to process thousands of requests per second and it scales down really bad, comparing to a multi-threaded system.
2) You are right, again.
Unfortunately, due to all these abstractions, you have to dive in order to understand what's going on in background. However, yes, there is a loop.
In particular, Nodejs is implemented with libuv.
Interesting to read.
But I don't know how to identify the following code as whether event-driven or event loop. It does not repeat anything except the callback function fired after db query is finished.
Event-driven is a term you normally use when there is an event-loop, and it means an app that is driven by events such as click-on-button, data-arrived, etc. Normally you associate a callback to such events.