process in GCP VM instance killed automatically - linux

I'm using GCP VM instance for running my python script as back ground process.
But I found my script got SIGTERM.
I check the syslog and daemon.log in /var/log
and I found my python script (2316) was terminated by system.
What do I need to check VM settings?

Judging from this log line in your screenshot:
Nov 12 18:23:10 ai-task-1 systemd-logind[1051]: Power key pressed.
I would say that your script's process was SIGTERMed as a result of the hypervisor gracefully shutting down the VM, which would happen when a GCP user or service account with admin access to the project performs a GCE compute.instances.stop request.
You can look for this request's logs for more details on where it comes from in the Logs Viewer/Explorer or gcloud logging read --freshness=30d (man) with some filters like:
resource.type="gce_instance"
"ai-task-1"
timestamp>="2020-11-12T18:22:40Z"
timestamp<="2020-11-12T18:23:40Z"
Though depending on the retention period for your _Default bucket (30 days by default), these logs may have already expired.

Related

My "spring boot application' cant be accessed but cpu and heap usage is normal

Background:
I have a kubernates cluster in my cloud server,And i deployed spring boot application as a pod .Everything is ok before 2 week , but my application unexpectedly could not be accessed on Ferb 22.
I used command "kubectl exec -it <pod> sh ",curl 127.0.0.1:<port> in pods but no response .I saw my application logs but cant found error about this issue.I try to restart my applications ,but the same issue occur after two days.
I have no idea with this issues.Can any one help me ?
When everything ok ,i can call 127.0.0.1:18890 and have response immediately ,once issues happend it will be requested timeout .Only this kubernates service have this question ,others seems normal

Pyspark: How can I collect all worker logs in a specified single file in a yarn cluster?

In my code I want to put some logger.info('Doing something'). Using standard library logging dosen't work.
You can use log4j for logging information in your application make sure to have log4j dependency provided in runtime with configured log4j.xml.
In order to aggregate logs, you need check following things
check whether yarn.log-aggregation-enable it set to true in yarn-site.xml and make sure necessary mount points are added in yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix.
For example : yarn.nodemanager.remote-app-log-dir=/mnt/app-logs,/mnt1/app-logs/
and yarn.nodemanager.remote-app-log-dir-suffix=/logs/,/logs/
With the above settings,logs are aggregated in HDFS at "/mnt/app-logs/{username}/logs/". Under this folder,
When the MapReduce/spark applications are running, then you can access the logs from the YARN's web UI. Once the application is completed, the logs are served through the Job History Server.
If your yarn.log-aggregation-enable is disable then you can check logs in this location of yarn.nodemanager.log-dirs in local node filesystem.
For Example : yarn.nodemanager.log-dirs=/mnt/hadoop/logs/,/mnt1/hadoop/logs/
yarn.log-aggregation.retain-seconds -- This value only makes sense when you have a long-running job which will take more than 7 days. (default value of yarn.log-aggregation.retain-seconds = 7 days ), which means those logs will be available to aggregate till 7 days after that cleanup job will delete from nodes.
After checking the above properties, you can do few things
You can use yarn Resource manager UI and check the logs of current running job .If it is finished you can check logs via history server
(OR)
You can ssh to master node and do yarn logs -applicationId appid only after application is finished.
Note: make sure your job history server is up and running and configured with enough resources
You can retrieve all logs (driver and executor) of your application into one file this way
:
yarn logs -applicationId <application id> /tmp/mylog.log
application id is the id of your application ... you can retrieve it from spark-history server if needed .
some exemple : application_1597227165470_1073

Bacula - Director unable to authenticate with Storage daemon

I'm trying to stay sane while configuring Bacula Server on my virtual CentOS Linux release 7.3.1611 to do a basic local backup job.
I prepared all the configurations I found necessary in the conf-files and prepared the mysql database accordingly.
When I want to start a job (local backup for now) I enter the following commands in bconsole:
*Connecting to Director 127.0.0.1:9101
1000 OK: bacula-dir Version: 5.2.13 (19 February 2013)
Enter a period to cancel a command.
*label
Automatically selected Catalog: MyCatalog
Using Catalog "MyCatalog"
Automatically selected Storage: File
Enter new Volume name: MyVolume
Defined Pools:
1: Default
2: File
3: Scratch
Select the Pool (1-3): 2
This returns
Connecting to Storage daemon File at 127.0.0.1:9101 ...
Failed to connect to Storage daemon.
Do not forget to mount the drive!!!
You have messages.
where the message is:
12-Sep 12:05 bacula-dir JobId 0: Fatal error: authenticate.c:120 Director unable to authenticate with Storage daemon at "127.0.0.1:9101". Possible causes:
Passwords or names not the same or
Maximum Concurrent Jobs exceeded on the SD or
SD networking messed up (restart daemon).
Please see http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION00260000000000000000 for help.
I double and triple checked all the conf files for integrity and names and passwords. I don't know where to further look for the error.
I will gladly post any parts of the conf files but don't want to blow up this question right away if it might not be necessary. Thank you for any hints.
It might help someone sometime who made the same mistake as I:
After looking through manual page after manual page I found it was my own mistake. I had (for a reason I don't precisely recall, I guess to trouble-shoot another issue before) set all ports to 9101 - for the director, the file-daemon and the storage daemon.
So I assume the bacula components must have blocked each other's communication on port 9101. After resetting the default ports like 9102, 9103 according to the manual, it worked and I can now backup locally.
You have to add director's name from the backup server, edit /etc/bacula/bacula-fd.conf on remote client, see "List Directors who are permitted to contact this File daemon":
Director {
Name = BackupServerName-dir
Password = "use *-dir password from the same file"
}

Shell multiple logs monitoring and correlation

I have been trying this for days but still struggling.
The objective of the script is to perform real time log monitoring on multiple servers (29 in particular) and correlate login failure records between servers. The servers' log will be compressed at 23:59:59 everyday, and a new log starts from 0 o'clock.
My idea was to use tail -f | grep "failed password" | tee centralized_log on every server, activated by a loop through all server names, run on background, and output the login failure records to a centralized log. But it dosn't work. And it creates a lot of daemons which will become zombies as soon as I terminates the script.
I am also considering to do tail at some minutes interval. But as the logs grow larger, the processing time will increase. How to set a pointer to where the previous tail stopped?
So could you please suggest a better and working way to do multiple logs monitoring and correlation. Additional installations are not encouraged unless totally necessary.
If your logs are going through syslog, and you're using rsyslogd, then you can configure the syslog on each machine to forward the specific messages you're interested in to one (or two) centralized log servers, using a property match like:
:msg, contains, "failed password"
See the rsyslog documentation for more details about how to set up reliable syslog forwarding.

Error message "barrier-based sync failed".

I'm using Amazon Linux EC2 AMI, recently I'm getting this error "barrier-based sync failed" every 10 days.
Running Services
Apache
MySQL
Memcached Server
PHP
This isn't an error message.
It is an informative log message which indicate that the hardware (or the storage driver) does not support cache flush command. These messages occur normally.
To stop the log message from being logged,
either remount the filesystem using the nobarrier option.
E.g. mount -o remount,nobarrier /dev/path /path/tomountpoint or pass the option barrier=off to the kernel on boot.
See here and here for more information.

Resources