Messages not making into elasticsearch - logstash

We are a setup with three queues in rabbitmq, handling three different types of logs.
The queues are handled by logstash, and given a tag, and then logstash dumps the message into the appropriate index in elasticsearch.
So my input looks something like this:
input {
rabbitmq {
host => "localhost"
queue => "central_access_logs"
durable => true
codec=> json
threads => 3
prefetch_count => 50
port => 5672
tags => ["central_access_log"]
}
And similar setup for the other two queues:
My output is like this:
if("central_access_log" in [tags]){
elasticsearch {
host => "localhost"
index=> "central_access_logs"
}
}
I suspected for a while that not everything was making it into the central_access_log index (the other two indexes, more of less, seemed fine), so I added this:
file {
path => '/data/out'
}
And let that run for a few weeks.
Recently, I noticed that for the last week and half, nothing has been coming into that index (again, the other two are perfectly fine), however the text file contains all the missing messages.
How can I go about debugging this? Is it an error on logstash's end, or elasticsearch?

Related

Elastic stack, how should I send my custom python logs

So I have some custom logging in python which comes in from multiple processes to log a certain crash with a format like
{timestamp},{PID},{Cause}
Now I want these events to be sent to logstash and be used in ELK to later see some info on my kibana dashboard. I've never used ELK before so my question is, what would be the best approach?
-Use python-logstash and have 2 loggers at once
-Simply send the array to logstash( HTTP i think? ) at the time it gets logged and use dissect later?
-Make a JSON when the logger is logging the line and send that to logstash?
If you want to send only the crash, then having a separate logger is handy.
logger.addHandler(logstash.TCPLogstashHandler(host, 5959, version=1)
and configure the logstash pipeline for tcp input.
input {
tcp {
port => 5959
}
}
filter {
json {
source => "message"
}
}
output {
elasticsearch {
hosts => "elasticsearch:9200"
}
}
for sending json like data, populate a dict in python and send then using the logger with extra parameter.
logger.log(level=logging.WARN, msg="process crashed", extra={'PID':pid, 'Cause': cause})
result records will look like
{
PID:321,
Cause:'abcde',
timestamp:"2019-12-10T13:37:51.906Z",
message:"process crashed"
}

Mongo Change Streams running multiple times (kind of): Node app running multiple instances

My Node app uses Mongo change streams, and the app runs 3+ instances in production (more eventually, so this will become more of an issue as it grows). So, when a change comes in the change stream functionality runs as many times as there are processes.
How to set things up so that the change stream only runs once?
Here's what I've got:
const options = { fullDocument: "updateLookup" };
const filter = [
{
$match: {
$and: [
{ "updateDescription.updatedFields.sites": { $exists: true } },
{ operationType: "update" }
]
}
}
];
const sitesStream = Client.watch(sitesFilter, options);
// Start listening to site stream
sitesStream.on("change", async change => {
console.log("in site change stream", change);
console.log(
"in site change stream, update desc",
change.updateDescription
);
// Do work...
console.log("site change stream done.");
return;
});
It can easily be done with only Mongodb query operators. You can add a modulo query on the ID field where the divisor is the number of your app instances (N). The remainder is then an element of {0, 1, 2, ..., N-1}. If your app instances are numbered in ascending order from zero to N-1 you can write the filter like this:
const filter = [
{
"$match": {
"$and": [
// Other filters
{ "_id": { "$mod": [<number of instances>, <this instance's id>]}}
]
}
}
];
Doing this with strong guarantees is difficult but not impossible. I wrote about the details of one solution here: https://www.alechenninger.com/2020/05/building-kafka-like-message-queue-with.html
The examples are in Java but the important part is the algorithm.
It comes down to a few techniques:
Each process attempts to obtain a lock
Each lock (or each change) has an associated fencing token
Processing each change must be idempotent
While processing the change, the token is used to ensure ordered, effectively-once updates.
More details in the blog post.
It sounds like you need a way to partition updates between instances. Have you looked into Apache Kafka? Basically what you would do is have a single application that writes the change data to a partitioned Kafka Topic and have your node application be a Kafka consumer. This would ensure only one application instance ever receives an update.
Depending on your partitioning strategy, you could even ensure that updates for the same record always go to the same node app (if your application needs to maintain its own state). Otherwise, you can spread out the updates in a round robin fashion.
The biggest benefit to using Kafka is that you can add and remove instances without having to adjust configurations. For example, you could start one instance and it would handle all updates. Then, as soon as you start another instance, they each start handling half of the load. You can continue this pattern for as many instances as there are partitions (and you can configure the topic to have 1000s of partitions if you want), that is the power of the Kafka consumer group. Scaling down works in the reverse.
While the Kafka option sounded interesting, it was a lot of infrastructure work on a platform I'm not familiar with, so I decided to go with something a little closer to home for me, sending an MQTT message to a little stand alone app, and letting the MQTT server monitor messages for uniqueness.
siteStream.on("change", async change => {
console.log("in site change stream);
const mqttClient = mqtt.connect("mqtt://localhost:1883");
const id = JSON.stringify(change._id._data);
// You'll want to push more than just the change stream id obviously...
mqttClient.on("connect", function() {
mqttClient.publish("myTopic", id);
mqttClient.end();
});
});
I'm still working out the final version of the MQTT server, but the method to evaluate uniqueness of messages will probably store an array of change stream IDs in application memory, as there is no need to persist them, and evaluate whether to proceed any further based on whether that change stream ID has been seen before.
var mqtt = require("mqtt");
var client = mqtt.connect("mqtt://localhost:1883");
var seen = [];
client.on("connect", function() {
client.subscribe("myTopic");
});
client.on("message", function(topic, message) {
context = message.toString().replace(/"/g, "");
if (seen.indexOf(context) < 0) {
seen.push(context);
// Do stuff
}
});
This doesn't include security, etc., but you get the idea.
Will that having a field in DB called status which will be updated using findAnUpdate based on the event received from change stream. So lets say you get 2 events at the same time from change stream. First event will update the status to start and the other will throw error if status is start. So the second event will not process any business logic.
I'm not claiming those are rock-solid production grade solutions, but I believe something like this could work
Solution 1
applying Read-Modify-Write:
Add version field to the document, all the created docs have version=0
Receive ChangeStream event
Read the document that needs to be updated
Perform the update on the model
Increment version
Update the document where both id and version match, otherwise discard the change
Yes, it creates 2 * n_application_replicas useless queries, so there is another option
Solution 2
Create collection of ResumeTokens in mongo which would store collection -> token mapping
In the changeStream handler code, after successful write, update ResumeToken in the collection
Create a feature toggle that will disable reading ChangeStream in your application
Configure only a single instance of your application to be a "reader"
In case of "reader" failure you might either enable reading on another node, or redeploy the "reader" node.
As a result: there might be an infinite amount of non-reader replicas and there won't be any useless queries

Logstash Always Keeps One message in PipeLine

I am using Logstash to read and parse logs from a file and send them to a Rest based API. My shipper is working fine, but I am experiencing a strange behavior.
Version:
logstash-2.3.2
Problem:
When Logstash shipper parses the first log entry, it does not send it, It keeps it in the pipeline. When it parses the second log entry, it sends the first log entry to the API. Hence one message always remains in the pipeline and it is not being sent towards my API.
Whenever I stop my Logstash shipper process, then it sends the last remaining message as well. So, In a sense no message is lost, but shipper always is one message behind.
Question:
Why is Logstash unable to flush out its pipeline and send message to the API as soon as it receives.
You should paste your logstash config and log format in order to get the correct answer, however from whatever you have described you seem to be using multiline plugin. So from logstash 2.2 onwards there is a auto_flush_interval for multline plugin in Codec. Basically this 'auto_flush_interval' can be set to a number of seconds and if multline input plugin does not listen any log line till the specified number of seconds then it will flush the input pending in pipepline to your API...
For example and more information please go through this:
input {
file {
path => "$LogstashFilePathValue"
type => "DemandwareError"
tags => "$EnvironmentName"
start_position => "beginning"
sincedb_path => "NUL"
codec => multiline {
pattern => "\A\[%{TIMESTAMP_ISO8601:demandware_timestamp} GMT\]"
negate => true
what => previous
auto_flush_interval => 10
}
}
}
The example is from the link: https://github.com/elastic/logstash/issues/1482
For more information on auto_flush_interval visit: https://www.elastic.co/guide/en/logstash/current/plugins-codecs-multiline.html#plugins-codecs-multiline-auto_flush_interval

logstash with date specific file names

I have an app that writes logs like
access_log-2014-09-08
access_log-2014-09-09
access_log-2014-09-10
It seems that if I have a input=>file=>path defined for access_log* it only works on what files are there when it started up. When midnight rolls around, and it makes a new file, logstash doesn't see it. Is there a way to specify a path that will catch this? Also I don't need it tailing anything except for the current day. It's not a huge problem if it looks at everything but it would be cleaner and nice to not do that.
Logstash config:
input {
file {
path => [ "/var/log/apache/access_log-*" ]
}
... filters and output ...
}

logstash multiple output doesn't work if one of outputs fails

I have the next configuration of logstash :
output {
elasticsearch {host => "elastichost"
stdout {codec => json}
file {
path => "./out.txt"
}
And in case when Elasticsearch host is unavaliable then I do not receive any output at all. There is just errors about ElasticSearch output fails.
So the question is how I can configure logstash for reliable sending logs to outputs even if one of them fails?
You can't do this in Logstash 1; any output thread that blocks will hang them all up.
The design of Logstash 2 is supposed to fix this.

Resources