Structured Logging with Apache Spark on Google Kubernetes Engine - apache-spark

I am running Apache Spark applications on a Google Kubernetes Engine Cluster which propagates any output from STDOUT and STDERR to Cloud Logging. However, granular log severity levels are not propagated. All messages will have only either INFO or ERROR severity in Cloud Logging (depending on whether it was written to stdout or stderr) and the actual severity level is hidden in a text property.
My goal is to format the messages in the Structured Logging JSON format so that the severity level is propagated to Cloud Logging. Unfortunately, Apache Spark still uses the deprecated log4j 1.x library for logging and I would like to know how to format log messages in a way that Cloud Logging can pick them up correctly.
So far, I am using the following default log4j.properties file:
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

When enabling Cloud Logging in a GKE cluster, the logging is managed by GKE, so it’s not possible to change the format of the logs as easily as it’s in a GCE instance.
To push JSON format logs in GKE, you can try the following options:
Make your software push logs in JSON format, so Cloud Logging will detect JSON formatted log entries and push them in this format.
Manage your own fluentd version as suggested in here and set up your own parser, but the solution becomes managed by you and no longer GKE.
Adds a sidecar container that reads your logs and converts them to JSON, then dumps the JSON to stdout. The logging agent in GKE will ingest the sidecar's logs as JSON.
Bear in mind that while using option three, there are some considerations that can lead to significant resource consumption and you won't be able to use kubectl logs as explained here.

Related

Application Insights Log4Net Appender - telemetry sampling and reliability

We use log4net and DB / File appenders currently to capture log4net logging. In Production, while we mostly have WARN and higher level logging alone, there are occasions INFO level logging is turned on to troubleshoot end user issues.
These appenders are reliable and we haven't had experience where logs were lost so our support could not troubleshoot issues.
We are setting up Azure hosting for our application and Application Insights is adding a lot of value with its standard telemetry collection. We are now evaluating redirecting log4net logs into Application Insights and gradually make AI the log aggregator.
MS has released a log4net AI appender - https://www.nuget.org/packages/Microsoft.ApplicationInsights.Log4NetAppender
I did not find enough information on how sampling can affect the logs. For e.g. if some INFO logs are lost to sampling, it is not ideal but we might still be ok. If WARN, ERROR or FATAL logs are lost, that will just make AI unreliable for our support to use and they would just fall back to the traditional logs in DB or files.
Sampling on automatic telemetry (like requests, dependencies) is ok and understood, but is there a mechanism to ensure log4net logs are reliably made available in Application Insights?
According to this issue, the Log4Net appender should be sampled in the same way as other telemetry is sampled.
So you can just turn off the sampling(the default sampling is Adaptive sampling) by using code.
For example, if it's an .net core application, you can use the code below as per this doc:
public void ConfigureServices(IServiceCollection services)
{
// ...
var aiOptions = new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions();
aiOptions.EnableAdaptiveSampling = false;
services.AddApplicationInsightsTelemetry(aiOptions);
//...
}

AKS log format changed

we recently updated our AKS cluster from 1.17.x to 1.19.x and recognised that the format of our custom application logs in /var/lib/docker/containers changed.
Before the update it looked like this:
Afterwards it looks like this:
I can find some notes in the changelog that kubernetes changed from just text logs to structured logs (for system components) but I don't see how this correlates to how our log format changed.
https://kubernetes.io/blog/2020/09/04/kubernetes-1-19-introducing-structured-logs/#:~:text=In%20Kubernetes%201.19%2C%20we%20are,migrated%20to%20the%20structured%20format
https://kubernetes.io/docs/concepts/cluster-administration/system-logs/
Is there a chance to still get valid json logs to /var/lib/docker/containers in AKS > 1.19.x?
Background:
We send our application logs to Splunk and don't use the Azure stack for log analysis. Our Splunk setup cannot parse that new log format as of now.
The format of the logs are defined by the container runtime. It seems before you were parsing logs from docker container runtime, and now it is containerd (https://azure.microsoft.com/en-us/updates/azure-kubernetes-service-aks-support-for-containerd-runtime-is-in-preview/).
Based on the article - you can still choose moby (which is docker) as the container runtime.
To take that also from your shoulders, you should look into using one of those (considering that they will automatically detect the log format and container runtime for you).
Splunk Connect for Kubernetes https://github.com/splunk/splunk-connect-for-kubernetes
Collectord https://www.outcoldsolutions.com

Options for getting logs in kubernetes pods

Have few developer logs in kubernetes pods, what is the best method to get the logs for the developers to see it.
Any specific tools that we can use?
I have the option of graylog, but not sure if that can be customized to get the developer logs into it.
The most basic method would be to simply use kubectl logs command:
Print the logs for a container in a pod or specified resource. If the
pod has only one container, the container name is optional.
Here you can find more details regarding the command and it's flags alongside some useful examples.
Also, you may want to use:
Logging Using Elasticsearch and Kibana
Logging Using Stackdriver
Both should do the trick in your case.
Please let me know if that is what you had on mind and if my answer was helpful.
If you want to see the application logs - from the development side you just need to print logs to the STDOUT and STDERR streams.
The container runtime (I guess Docker in your case) will redirect those streams
to /var/log/containers.
(So if you would ssh into the Node you can run docker logs <container-id> and you will see all the relevant logs).
Kuberentes provides an easier way to access it by kubectl logs <pod-name> like #Wytrzymały_Wiktor described.
(Notice that the logs are being rotated automatically every 10MB so the kubectl logs command will show the log entry from the last rotation only).
If you want to send the logs to a central logging system like (ELK, Splunk, Graylog etc') you will have to forward the logs from your cluster by running log forwarders inside your cluster.
This can be done for example by running a daemonset that manage a pod on each node that will access the logs path (/var/log/containers) by a hostPath volume and will forward the logs to the remote endpoint. See example here.

What's the best way stream logs to CloudWatch Logs for a Spark Structured Streaming application?

The easiest solution I can think of is to attach an appender for CloudWatch Logs to Log4J (e.g., https://github.com/kdgregory/log4j-aws-appenders). The problem is this will not capture YARN logs, so if YARN failed to start the application altogether, nothing would reach CloudWatch about this failure.
Another option is to forward all spark-submit output (stdin and stdout) to a file and use CloudWatch Logs agent (installed on master) to stream everything. These will be simple text though, so I'll need to process the logs and extract date, level etc.
I'm running my application on AWS EMR. S3 logs are not an option as these are essentially archived logs and not real time.

Pyspark Streaming - How to set up custom logging?

I have a pyspark streaming application that runs on yarn in a Hadoop cluster. The streaming application reads from a Kafka queue every n seconds and makes a REST call.
I have a logging service in place to provide an easy way to collect and store data, send data to Logstash and visualize data in Kibana. The data needs to conform to a template (JSON with specific keys) provided by this service.
I want to send logs from the streaming application to Logstash using this service. For this, I need to do two things:
- Collect some data while the streaming app is reading from Kafka and making the REST call.
- Format it according to the logging service template.
- Forward the log to logstash host.
Any guidance related to this would be very helpful.
Thanks!

Resources