How do I enable DEBUG logging just for slow queries? - cassandra

The debugging level is set at INFO, I want to enable logging of slow running queries for which slow_query_log_timeout_in_ms is set. If debugging level will be set to DEBUG, it will log queries running longer than slow_query_log_timeout_in_ms but at the same time lot of other debugging info will be logged.
I do not want any debug info other than slow running query, is it possible to only enable logging of slow running queries and nothing else at debug level?
[cassandra#localhost ~]$ nodetool getlogginglevels
Logger Name Log Level
...
org.apache.cassandra INFO

Debug logging is enabled by default since Cassandra 2.2 and goes to debug.log. This is to reduce the "noise" that goes into system.log.
It is possible to disable debug logging by removing the appender in conf/logback.xml but it is not recommended. Debug logs are crucial if you are investigating an issue or need to understand what is going on with your system so it's recommended to always have it on.
If you want to go against best practice, disable debug logging and set the logging level for the class org.apache.cassandra.db.monitoring to DEBUG. Note that this isn't persistent and you will need to set it every time Cassandra is restarted. Cheers!

Related

Linux journal daemon log and/or persist only the error/critical event logs

With the systemd journal daemon is there a way to configure the persisted logs (/var/log/journal/*) so that only Errors and Critical level events are actually stored to disk?
We have extremely high volume servers that would generated gigabytes daily in just kernel logging but as far as the journal goes we only really care for error/critical events.
I've checked the manpage for /etc/systemd/journald.conf but there doesn't seem to be much there in the way of selecting which level of events persist (Or which levels to ignore in general)
(https://www.freedesktop.org/software/systemd/man/journald.conf.html)
Use MaxLevelStore=crit
This controls the maximum log level of messages that are stored in the journal.
Messages equal or below the log level specified are stored, messages above are dropped. Defaults to "debug".

Spark event logs for long running process

I have a long spark process running and could see event logs generated taking huge space. When I investigated logs I could see apart from a regular job, stage, task event logs I can see several logs like
{"Event":"SparkListenerUnpersistRDD","RDD ID":415}.
I have made sure no manual unPesistRDD calls exist but I can still see many logs generated like this.
Is there a way with which I can disable unPersistRDD event logs?

application getting slower over a period on iis

We have hosted application on IIS 8,it becomes slower over a period of time.When app pool recycles after 29 hours it is back to normal performance but we are not getting outofmemory exception.
application getting slower over a period and is back to normal performance on default app pool recycle
How slow is it ? Don't back on manual interpretation here. Check IIS logs (Default Location: C:\inetpub\logs\logfiles\w3svc_website_ID) and look at the time-taken field. The time-taken will be in milliseconds and time field would be in UTC.
Based on the IIS logs, set up a FREB rule as per - https://blogs.msdn.microsoft.com/docast/2016/04/28/troubleshooting-iis-request-performance-slowness-issues-using-freb-tracing/
In order to analyze FREB log, open it in IE and click on compact view. Observe the time field on the right hand side. Once you observe a definite jump in time field, that's your culprit. if you do not observe any changes in the time field, then the last module is the possible culprit.
You can also capture manual hang dumps of the w3wp process using debug diag during the time of issue. Install debug diag from - https://www.microsoft.com/en-us/download/details.aspx?id=49924
Capture a manual hang dump during the time of issue. Search specifically for "Create a user dump file for a process" in this page and follow the steps mentioned under it - https://support.microsoft.com/en-us/help/919792/how-to-use-the-debug-diagnostics-tool-to-troubleshoot-a-process-that-h
Once you have the required dump file, run debug diag analysis engine on this and look out for the results. Let me know if you need help in dump analysis

logstash parse with offline logs will give better performance or online?

I have ELK stack installed and about to do performance testing.
Getting below doubt which am not able to resolve myself, expertise suggestions/opinions would be helpful.
I am doubtful on,
1. whether to do logstash on LIVE - meaning, install logstash and run ELK in parallel with my performance testing on application.
2. Or First do the performance testing collect logs and feed logs to logstash offline. (this option is very much possible, as am running this test for about 30minutes only)
Which will b better performant ?
My application is on Java and since logstash also uses JVM for its parsing, am afraid it will have impact on my application performance.
Considering this, I prefer to go with option 2 , but would like to know are there any benefits/advantages going with option 1 that am missing ??
Help/suggestions much appreciated
Test your real environment under real conditions to get anything meaningful.
Will you run logstash on the server? Or will you feed your logs in the background to i.e. Kafka as described in my blogpost you summoned me from? Or will you run a batch job and then after the fact collect the logs?
Of course doing anything on the server itself during processing will have an impact and also tuning your JVM will have a big influence on how well everything performs. In general it is not an issue to run multiple JVMs on the same server.
Do your tests once with logstash / kafka / flume or any other log processing or shipping tool you want to use enabled and then run a second pass without these tools to get an idea of how much they impact the performance.

Logstash reaches 99% CPU usage and freezes forever (or until restarted)

I'm currently running an ELK cluster on reasonably weak hardware (four
virtual machines, with 4 GB memory assigned and two core each). This is slated to change in a couple of months, but for now we still need to ingest and make logs available.
After getting all of the servers of one service sending their logs to
Logstash via nxlog, collection worked fairly well for a few days.
Shortly after that, logstash frequently started to wedge open. The
logstash thread 'filterworker.0' will jump to 93 and then 99% of the
server's CPU. Logstash itself won't terminate; instead it will continue
on, hung, never sending any fresh logs to Elasticsearch. Debug logs will
show that logstash is continually calling flush by interval. It will
never recover from this state; it ran an entire weekend hung and only
resumed normal operations when I restarted it. Logstash would start
catching up on the weekend's logs and then quickly free again (usually
within five to ten minutes), requiring another restart of the service.
Once the logs had been able to mostly catch up (many restarts later and
some turning off of complicated grok filters), logstash returned to its
previous habit of wedging open every five to thirty minutes.
I attempted to narrow this down to a particular configuration and
swapped my log filters into and out of the conf.d directory. With fewer
configs, logstash would run for longer periods of time (up to an hour
and a half) but eventually it would freeze again.
Connecting jstack to the PID of the frozen filterworker.0 thread
returned mostly 'get_thread_regs failed for a lwp' debugger exceptions
and no deadlocks found.
There are no actual failures in logstash's logs when run at debug
verbosity; just those buffering loglines.
The disks are not full.
Our current configuration is three elasticsearch nodes, all receiving
input from the logstash server (using logstash's internal load
balancer). We have a single logstash server. These are all CentOS 7
machines. The logstash machine is running version 2.1.3, sourced from
Elastic's yum repository.
I've played around with changing the heap size, but nothing appears to
help, so I'm currently running it at the out of the box defaults. We
only use one worker thread as it's a single core virtual machine. We
used to use multiline, but that was the first thing I commented out when
this started to happen.
I'm not sure where to go next. My theory is that logstash's buffer is
just unable to handle the current log traffic; but without any
conclusive errors in the logs, I'm not sure how to prove it. I feel like
it might be worth putting a redis or rabbit queue between nxlog and
logstash to buffer the flood; does that seem like a reasonable next step?
Any suggestions that people might have would be greatly appreciated.
You may try to reset the Java environment,when I start up my logstash ,it will up to 99% cpu usage,but when the JVM start over ,the cpu usage will down to 3%,so I guess,maybe your java environment have something wrong.
Wish help.
I use monit to monitor the service and check for high CPU usage and then restart Logstash according to the findings. Bit of a workaround, not really a long term solution.
A queuing system would probably do the trick, check out Kafka, Redis, or RabbitMQ. You would need to measure the difference rate at which the queue is written to vs read from.
It sounds like you need more Logstash nodes. We experienced similar outages, caused by CPU, when the log throughput went up for various reasons. We are putting on aprrox. 6K lines per second and have 6 nodes (just for reference).
Also, putting a Redis pipeline in front of the Logstash nodes allowed us to configure our Logstash nodes to pull and process accordingly. Redis has allowed our Logstash nodes to now be over provisioned as they don't bear the brunt of the traffic. They pull log entries and their utilization is more consistent (no more crashing).

Resources