AWS EC2 instance Application logs - linux

I want to store logs of applications like uWSGI ("/var/log/uwsgi/uwsgi.log") on a device that can be accessed from
multiple instances and can save their logs to that particular device under their own instance name dir.
So does AWS provides any solution to do that....

There are a number of approaches you can take here. If you want to have an experience that is like writing directly to the filesystem, then you could look at using something like s3fs to mount a common S3 bucket to each of your instances. This would give you more or less a real-time log merge though honestly I would be concerned over the performance of such a set up in a high volume application.
You could process the logs at some regular interval to push the data to some common store. This would not be real time, but would likely be a pretty simple solution. The problem here is that it may be difficult to interleave your log entries from different servers if you need to have them arranged in time order.
Personally, I set up a Graylog server for each instance cluster I have, to which I log all my access logs, error logs, etc. It is UDP based, so it is fire and forget from the application servers' standpoint. It provides nice search/querying tools as well. Personally I like this approach as it removes log management from the application servers altogether.

Two options that I've used:
Use syslog (or Syslog-NG) to log to a centralized location. We do this to ship our AWS log data offsite to our datacenter. Syslog-NG is more reliable than plain ole' Syslog and allows us to use MongoDB as a backing store.
Use logrotate to push your logs to S3. It's not real-time like the Syslog solution, but it's a lot easier to set up and manage, especially if you have a lot of instances and aren't using a VPC
Loggly and Splunk Storm are also two interesting SaaS products intended to solve this problem.

Related

How to create a log collection service using nodejs?

I need to build a log collection system.
I find out that common log collection schemes include elk and Hadoop / hive.
1、As a front-end developer, can I spend a certain amount of time (for example, one week) to complete simple construction without a service-side foundation?
2、Can I use nodejs, mongodb and other technology stacks to build a log system?
ELK already includes LogStash or Filebeat exactly for this purpose; you don't need Node to collect log files.
For Hadoop (and Elasticsearch), it uses Log4j, which offer different appenders to write to external locations and/or log files on disk. With files on disk, you'd use Filebeat (or FluentD) to process those files and send to external locations, such as Apache Kafka topics, which can be read into Hadoop and/or Elasticsearch (or Splunk) for analytics. Mongo generally isn't well suited for log analysis or plaintext log searching.
For handling logs from Node services, they can also write to disk using standard NPM libraries, then tools listed above would treat those the same as logs from other services.

Shipping logs from network share using Filebeat on Windows

The problem statement: I have an application running on Windows. I want to ship logs files from this application to ELK fronted by Kafka.
The challenge: This application writes a lot of process metadata to disk under a directory location. This information is important for the application's recovery and hence is stored on a network storage to support DR. The application also writes logs to the same directory location and we do not have the ability to separate the logs from the other process metadata. As a result logs are written to network share.
I want to ship the logs to Elastic. We typically use beats to do this. However, Filebeat does not recommending shipping logs from network storage on Windows. Ref: https://www.elastic.co/guide/en/beats/filebeat/7.11/filebeat-network-volumes.html. I have also read various git issues and SO posts where people have complained about Filbeat stopping harvesting on rollover.
Since this is a network share, I was also not able to create a symlink or a Junction link to trick my application to write the logs to the hard disk.
Has anyone solved this issue?
P.S.: I also read somewhere that logstash has better handling of files on network share. However, I do not need logstash and would like to avoid it if possile. Also, logstash official documentation mentions that reading files from NFS is only occasionally tested. It is not thoroughly tested.

Logstash vs Rsyslog for log file aggregation

I am working on a solution for centralized log file aggregation from our CentOs 6.x servers. After installing Elasticsearch/Logstash/Kibana (ELK) stack I came across an Rsyslog omelasticsearch plugin which can send messages from Rsyslog to Elasticsearch in logstash format and started asking myself why I need Logstash.
Logstash has a lot of different input plugins including the one accepting Rsyslog messages. Is there a reason why I would use Logstash for my use case where I need to gather the content of logs files from multiple servers? Also, is there a benefit of sending messages from Rsyslog to Logstash instead of sending them directly to Elasticsearch?
I would use Logstash in the middle if there's something I need from it that rsyslog doesn't have. For example, getting GeoIP from an IP address.
If, on the other hand, I would need to get syslog or file contents indexed in Elasticsearch, I'd use rsyslog directly. It can do buffering (disk+memory), filtering, you can choose how the document will look like (you can put the textual severity instead of the number, for example), and it can parse unstructured data. But the main advantage is performance, on which rsyslog is focused on. Here's a presentation with some numbers (and tips and tricks) on Logstash, rsyslog and Elasticsearch:
http://blog.sematext.com/2015/05/18/tuning-elasticsearch-indexing-pipeline-for-logs/
I would recommend logstash. That would be easier to setup, more examples and they are tested to fit together.
Also, there are some benefits, in logstash you can filter and modify your logs.
You can extend logs with useful data: server name, timestamp, ...
Cast types, string to int, etc. (useful for correct Elastic index)
Filter out logs by some rules
Moreover, you can setup batch size to optimize saving to elastic.
Another feature, if something went wrong and there are crazy amount of logs per second that elastic can not process, you can setup logstash that it would save some queue of events or drop events that can not be saved.
If you go straight from the server to elasticsearch, you can get the basic documents in (assuming the source is json, etc). For me, the power of logstash is to add value to the logs by applying business logic to modify and extend the logs.
Here's one example: syslog provides a priority level (0-7). I don't want to have a pie chart where the values are 0-7, so I make a new field that contains the pretty names ("emerg", "debug", etc) that can be used for display.
Just one example...
Neither are a viable option if you really want to rely on the system to operate under load and be highly available.
We found that using rsyslog to send to a centralized location, archive it using redis of kafka and then using logstash to do its magic and ship to Elasticsearch is the best option.
Read our blog about it here - http://logz.io/blog/deploy-elk-production/
(Disclaimer - I am the VP product for logz.io and we offer ELK as a service)

Keeping Multiple Servers in a Cluster In-Sync?

I'm currently managing a cluster of PHP-FPM servers, all of which tend to get out of sync with each other. The application that I'm using on top of the app servers (Magento) allows for admins to modify various files on the system, but now that the site is in a clustered set up modifying a file only modifies it on a single instance (on one of the app servers) of the various machines in the cluster.
Is there an open-source application for Linux that may allow me to keep all of these servers in sync? I have no problem with creating a small VM instance that can listen for changes from machines to sync. In theory, the perfect application would have small clients that run on each machine to be synced, which would talk to the master server which would then decide how/what to sync from each machine.
I have already examined the possibilities of running a centralized file server, but unfortunately my app servers are spread out between EC2 and physical machines, which makes this unfeasible. As there are multiple app servers (some of which are dynamically created depending on the load of the site), simply setting up a rsync cron job is not efficient as the cron job would have to be modified on each machine to send files to every other machine in the cluster, and that would just be a whole bunch of unnecessary data transfers/ssh connections.
I'm dealing with setting up a similar solution. I'm half way there. I would recommend you use lsyncd, which basically monitors the disk for changes and then immediately (or whatever interval you want) automatically syncs files to a list of servers using rsync.
The only issue I'm having is keeping the server lists up to date, since I can spin up additional servers at any time, I would need to have each machine in the cluster notified whenever a machine is added or removed from the cluster.
I think lsyncd is a great solution that you should look into. The issue I'm having may turn out to be a problem for you as well, and that remains to be solved.
Instead of keeping tens or hundreds of servers cross-synchronized it would be much more efficient, reliable, and most of all simple maintaining just one "admin node" and replicating changes from that to all your "worker nodes".
For instance at our company we use a Development server -> Staging server -> Live backends workflow where all the changes are transferred across servers using a custom php+rsync front end. That allows the developers to push updates to a Staging server in the live environment, test out changes, and roll them to Live backends incrementally.
A similar approach could very well work in your case as well. Obviously it's not a plug-and-play solution, but I see it as the easiest way to go - both in terms of maintainability and scalability.

distributed logging: JMS and log4j?

Been doing some searching for a solution to this problem: I need log entries from apps running on several machines to be sent to & aggregated on a remote server. Requirements:
logging in the app needs to be asynchronous (can't wait for log entry to traverse network)
logging in the app needs to be queued; if the network fails, log entries need to be queued locally and sent to
centralized server when the network becomes available again
I'm looking at using log4j and a JMSAppender. Assuming that's a suitable solution, are there any examples available? What process would be running on the centralized server to receive log entries in this scenario?
Thanks.
One simple setup I came to think about is to use Apache ActiveMQ
It is an open source messaging broker (JMS compatible) that is able to cluster queues among several physical machines and the ActiveMQ installation is rather lightweight. You simple install one ActiveMQ on each of your applications machines. Then on the logging server (Physical Server C in the picture) you would have another ActiveMQ. Your application would use a JMS appender (read more here) and you could actually just use the included apache camel to read from the queue and write a log on file or database without needing to write an application for that task.
It could be as simple as adding something like the following to the camel.xml in the activemq /conf installation and import the camel.xml in the activemq.xml configuration.
<route>
<from uri="activemq:queue:LogQueue"/>
<to uri="file:target/folder/?fileName=logfile.log&fileExist=Append"/>
</route>
You could use a myrriad of other frameworks, JMS servers and technologies, but I think this is a rather easy approach to achieve with very low cost and high stability.

Resources