Avoid collapsing several messages into one with node.js winston + logstash

Avoid collapsing several messages into one with node.js winston + logstash - node.js

In node.js, having set up winston and logstash...
I observe in the logstash user-interface (Kibana) that oftentimes several logging messages are tucked into one row as if they are a single message. Any quick shot at which component is causing this and how it can be avoided?
although message groups could be nice in general.... the messages are collapsed quite arbitrarily and it is detrimental - as the message structure of such a group is different than a regular message, which really doesn't help mining the data.
Hopefully the transport sends over chunks to save on communication overheads, but I would very much like that each message emitted by my code to winston, remains a single message and does not get grouped with other ones.
(I am currently using winston-logstash for funneling from winston to logstash).

Seems the logstash split filter avoids this. Poor documentation BTW.

Related

How should I manage the number of sockets in a node.js application?

I am building my first web-based node.js application - an online game - as a hobby/project to try and teach myself how it all works.
I'm using socket.io to send real-time updates (who's in the lobby, points scored etc) to users, but I'm not sure whether the way I'm managing the sockets, and the information being sent through them, in the best way.
Whenever the game is updated, I'm sending an object to each user which updates everything at once, and a lot of the time, the information being updated is actually staying the same. For example, if a user scores a point, an update is sent to everyone's browser to update the leaderboard, but that same socket.on function is re-sending information such as usernames, which stay the same throughout the game:
exampleObject = {
"usernames" : [username1, username2], // only gets updated in the browser once, but is sent every time
"points": {
"username1": 1, // Different value with every update
"username2": 3
}
}
(The real object is quite a bit bigger than this)
Would it be more sensible to have a different socket.on function for every individual piece of information which needs updating, so I can then call them individually as and when required, or is there any sense in updating everything through one function? Any thoughts/advice would be greatly appreciated.

If you are regularly sending a piece of information over and over, then it makes sense to design a specific message that only contains that specific information so you aren't regularly sending information that does not need to be sent. You can have as many different messages as you want and you should use that to design efficient messages, particularly for the most common messages.
Would it be more sensible to have a different socket.on function for every individual piece of information which needs updating, so I can then call them individually as and when required
Yes. Design efficient messages specifically for things you regularly send.
or is there any sense in updating everything through one function?
Only if you need to change lots of stuff at once. It's wasteful to include data in a frequent message that never changes and doesn't need to be sent.
It's perfectly fine to have different messages you send for different purposes and then the client has different listeners for those specific messages. At the same time, if you regularly send three pieces of data together, you probably wouldn't make a separate message for each piece of data - you'd put those three together such that your message structure aligns with your usage.
And, you can also have different messages for different purposes even if some data is in both messages.
One more note here. The title of your question "How should I manage the number of sockets in a node.js application?" seems to ask about managing the number of sockets. But, the rest of your question isn't about that at all. The rest of your question is about having different messages on the same socket. You don't need a new socket in order to define and use a different message. You can have thousands of different messages that you use all on the same socket connection. That's the whole architecture of socket.io. You send a message name and some data that goes with it. You can use a limitless number of separate message names all on the same connection.

Is there a log4j appender for Moogsoft?

Our current approach is to:
Send all events to Splunk (through Splunk's own log4j-appender).
Define Splunk alerts, which trigger Moogsoft.
Obviously, this increases the latency and relies on Splunk more than necessary. Which makes me wonder, if someone has already developed a Moogsoft-appender for log4j.
A simple search hasn't brought anything up -- hence this question.

i haven't done this, but log4j has a SocketAppender
https://howtodoinjava.com/log4j/log4j-socketappender-and-socket-server-example/
that might fit with Moogsofts SocketLam
https://docs.moogsoft.com/en/configure-the-socket-lam.html
Alternatively:
https://github.com/logstash/log4j-jsonevent-layout
gives json layout to log4j which then could be received with a REST Lam

I don't know of anyone that has put together an actual appender, but I don't think you'd need one. An HTTP appender with a JSON layout sending to a Moogsoft REST adapter should be able to do the job, and might be a lot easier to set up than handling raw bytes off a socket.
I haven't done it so I'm not sure how much work it would be to set up. I suspect there's some work involved on either the log4j side to get the layout to look like Moogsoft wants it, or on the Moogsoft side to normalize what it gets sent.

"Resequencing" messages after processing them out-of-order

I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!

This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.

Reporting progress on a million call process

I have a console/desktop application that crawls a lot (think million calls) of data from various webservices. At any given time I have about 10 threads performing these call and aggregating the data into a MySql database. All seeds are also stored in a database.
What would be the best way to report it's progress? By progress I mean:
How many calls already executed
How many failed
What's the average call duration
How much is left
I thought about logging all of them somehow and tailing the log to get the data. Another idea was to offer some kind of output to a always open TCP endpoint where some form of UI could read the data and display some aggregation. Both ways look too rough and too complicated.
Any other ideas?

The "best way" depends on your requirements. If you use a logging framework like NLog, you can plug in a variety of logging targets like files, databases, the console or TCP endpoints.
You can also use a viewer like Harvester as a logging target.
When logging multi-threaded applications I sometimes have an additional thread that writes a summary of progress to the logger once every so often (e.g. every 15 seconds).

since it is a Console Application, just use Writeline, just have the application spit the important stuff out to the Console.
I did something Similar in an application that I created to export PDF's from a SQL Server Database back into PDF Format
you can do it many different ways. if you are counting records and their size you can run a tally of sorts and have it show the total every so many records..
I also wrote out to a Text File, so that I could keep track of all the PDFs and what case numbers they went to and things like that. that information is in the answer that I gave to the above linked question.
you could also write things out to a Text File every so often with the statistics.
the logger that Eric J. mentions is probably going to be a little bit easier to implement, and would be a nice tool for your toolbox.
these options are just as valid depending on your specific needs.

Nlog Async and Log Sequence

In my nlog configuration, I've set
<targets async="true">
with the understanding that all logging now happens asynchronously to my application workflow. (and I have noticed a performance improvement, especially on the Email target).
This has me thinking about log sequence though. I understand that with async, one has no guarantee of the order in which the OS will execute the async work. So if, in my web app, multiple requests come in to the same method, each logging their occurrence to NLog, does this really mean that the sequence in which the events appear in my log target will not necessarily be the sequence in which the log method was called by the various requests?
If so, is this just a consequence of async that one has to live with? Or is there something I can do to keep have my logs reflect the correct sequence?

Unfortunately this is something you have to live with. If it is important to maintain the sequence you'll have to run it synchronously.
But if it is possible for you to manually maintain a sequence number in the log message, it could be a solution.

I know this is old and I'm just ramping up on NLog but if you see a performance increase for the email client, you may want to just assert ASYNC for the email target?

NLog will not perform reordering of LogEvent sequence, by activating <targets async="true">. It just activates an internal queue, that provides better handling of bursts and enables batch-writing.
If a single thread writes 1000 LogEvents then they will NOT become out-of-order, because of async-handling.
If having 10 threads each writing 1000 LogEvents, then their logging will mix together. But the LogEvents of an individual thread will be in the CORRECT order.
But be aware that <targets async="true"> use the overflowAction=Discard as default. See also: https://github.com/nlog/NLog/wiki/AsyncWrapper-target#async-attribute-will-discard-by-default
For more details about performance. See also: https://github.com/NLog/NLog/wiki/performance

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string