AKS log format changed - azure

we recently updated our AKS cluster from 1.17.x to 1.19.x and recognised that the format of our custom application logs in /var/lib/docker/containers changed.
Before the update it looked like this:
Afterwards it looks like this:
I can find some notes in the changelog that kubernetes changed from just text logs to structured logs (for system components) but I don't see how this correlates to how our log format changed.
https://kubernetes.io/blog/2020/09/04/kubernetes-1-19-introducing-structured-logs/#:~:text=In%20Kubernetes%201.19%2C%20we%20are,migrated%20to%20the%20structured%20format
https://kubernetes.io/docs/concepts/cluster-administration/system-logs/
Is there a chance to still get valid json logs to /var/lib/docker/containers in AKS > 1.19.x?
Background:
We send our application logs to Splunk and don't use the Azure stack for log analysis. Our Splunk setup cannot parse that new log format as of now.

The format of the logs are defined by the container runtime. It seems before you were parsing logs from docker container runtime, and now it is containerd (https://azure.microsoft.com/en-us/updates/azure-kubernetes-service-aks-support-for-containerd-runtime-is-in-preview/).
Based on the article - you can still choose moby (which is docker) as the container runtime.
To take that also from your shoulders, you should look into using one of those (considering that they will automatically detect the log format and container runtime for you).
Splunk Connect for Kubernetes https://github.com/splunk/splunk-connect-for-kubernetes
Collectord https://www.outcoldsolutions.com

Related

Structured Logging with Apache Spark on Google Kubernetes Engine

I am running Apache Spark applications on a Google Kubernetes Engine Cluster which propagates any output from STDOUT and STDERR to Cloud Logging. However, granular log severity levels are not propagated. All messages will have only either INFO or ERROR severity in Cloud Logging (depending on whether it was written to stdout or stderr) and the actual severity level is hidden in a text property.
My goal is to format the messages in the Structured Logging JSON format so that the severity level is propagated to Cloud Logging. Unfortunately, Apache Spark still uses the deprecated log4j 1.x library for logging and I would like to know how to format log messages in a way that Cloud Logging can pick them up correctly.
So far, I am using the following default log4j.properties file:
log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
When enabling Cloud Logging in a GKE cluster, the logging is managed by GKE, so it’s not possible to change the format of the logs as easily as it’s in a GCE instance.
To push JSON format logs in GKE, you can try the following options:
Make your software push logs in JSON format, so Cloud Logging will detect JSON formatted log entries and push them in this format.
Manage your own fluentd version as suggested in here and set up your own parser, but the solution becomes managed by you and no longer GKE.
Adds a sidecar container that reads your logs and converts them to JSON, then dumps the JSON to stdout. The logging agent in GKE will ingest the sidecar's logs as JSON.
Bear in mind that while using option three, there are some considerations that can lead to significant resource consumption and you won't be able to use kubectl logs as explained here.

process in GCP VM instance killed automatically

I'm using GCP VM instance for running my python script as back ground process.
But I found my script got SIGTERM.
I check the syslog and daemon.log in /var/log
and I found my python script (2316) was terminated by system.
What do I need to check VM settings?
Judging from this log line in your screenshot:
Nov 12 18:23:10 ai-task-1 systemd-logind[1051]: Power key pressed.
I would say that your script's process was SIGTERMed as a result of the hypervisor gracefully shutting down the VM, which would happen when a GCP user or service account with admin access to the project performs a GCE compute.instances.stop request.
You can look for this request's logs for more details on where it comes from in the Logs Viewer/Explorer or gcloud logging read --freshness=30d (man) with some filters like:
resource.type="gce_instance"
"ai-task-1"
timestamp>="2020-11-12T18:22:40Z"
timestamp<="2020-11-12T18:23:40Z"
Though depending on the retention period for your _Default bucket (30 days by default), these logs may have already expired.

Pyspark: How can I collect all worker logs in a specified single file in a yarn cluster?

In my code I want to put some logger.info('Doing something'). Using standard library logging dosen't work.
You can use log4j for logging information in your application make sure to have log4j dependency provided in runtime with configured log4j.xml.
In order to aggregate logs, you need check following things
check whether yarn.log-aggregation-enable it set to true in yarn-site.xml and make sure necessary mount points are added in yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix.
For example : yarn.nodemanager.remote-app-log-dir=/mnt/app-logs,/mnt1/app-logs/
and yarn.nodemanager.remote-app-log-dir-suffix=/logs/,/logs/
With the above settings,logs are aggregated in HDFS at "/mnt/app-logs/{username}/logs/". Under this folder,
When the MapReduce/spark applications are running, then you can access the logs from the YARN's web UI. Once the application is completed, the logs are served through the Job History Server.
If your yarn.log-aggregation-enable is disable then you can check logs in this location of yarn.nodemanager.log-dirs in local node filesystem.
For Example : yarn.nodemanager.log-dirs=/mnt/hadoop/logs/,/mnt1/hadoop/logs/
yarn.log-aggregation.retain-seconds -- This value only makes sense when you have a long-running job which will take more than 7 days. (default value of yarn.log-aggregation.retain-seconds = 7 days ), which means those logs will be available to aggregate till 7 days after that cleanup job will delete from nodes.
After checking the above properties, you can do few things
You can use yarn Resource manager UI and check the logs of current running job .If it is finished you can check logs via history server
(OR)
You can ssh to master node and do yarn logs -applicationId appid only after application is finished.
Note: make sure your job history server is up and running and configured with enough resources
You can retrieve all logs (driver and executor) of your application into one file this way
:
yarn logs -applicationId <application id> /tmp/mylog.log
application id is the id of your application ... you can retrieve it from spark-history server if needed .
some exemple : application_1597227165470_1073

Options for getting logs in kubernetes pods

Have few developer logs in kubernetes pods, what is the best method to get the logs for the developers to see it.
Any specific tools that we can use?
I have the option of graylog, but not sure if that can be customized to get the developer logs into it.
The most basic method would be to simply use kubectl logs command:
Print the logs for a container in a pod or specified resource. If the
pod has only one container, the container name is optional.
Here you can find more details regarding the command and it's flags alongside some useful examples.
Also, you may want to use:
Logging Using Elasticsearch and Kibana
Logging Using Stackdriver
Both should do the trick in your case.
Please let me know if that is what you had on mind and if my answer was helpful.
If you want to see the application logs - from the development side you just need to print logs to the STDOUT and STDERR streams.
The container runtime (I guess Docker in your case) will redirect those streams
to /var/log/containers.
(So if you would ssh into the Node you can run docker logs <container-id> and you will see all the relevant logs).
Kuberentes provides an easier way to access it by kubectl logs <pod-name> like #Wytrzymały_Wiktor described.
(Notice that the logs are being rotated automatically every 10MB so the kubectl logs command will show the log entry from the last rotation only).
If you want to send the logs to a central logging system like (ELK, Splunk, Graylog etc') you will have to forward the logs from your cluster by running log forwarders inside your cluster.
This can be done for example by running a daemonset that manage a pod on each node that will access the logs path (/var/log/containers) by a hostPath volume and will forward the logs to the remote endpoint. See example here.

What's the best way stream logs to CloudWatch Logs for a Spark Structured Streaming application?

The easiest solution I can think of is to attach an appender for CloudWatch Logs to Log4J (e.g., https://github.com/kdgregory/log4j-aws-appenders). The problem is this will not capture YARN logs, so if YARN failed to start the application altogether, nothing would reach CloudWatch about this failure.
Another option is to forward all spark-submit output (stdin and stdout) to a file and use CloudWatch Logs agent (installed on master) to stream everything. These will be simple text though, so I'll need to process the logs and extract date, level etc.
I'm running my application on AWS EMR. S3 logs are not an option as these are essentially archived logs and not real time.

Resources