I'm trying to manually OOM Kill pods for testing purposes, does anyone know how I can achieve this?
You can run stress-ng in the pod. With this tool you can also stress CPU, I/O altogether if you need.
Related
We are using AWS ECS with launch type "Fargate" manage our containers. We have memory leak issues which we're actively investigating, however, in the meantime we need a solution to take down tasks that pass a certain memory threshold.
Using the AWS cli to run "update-service force-new-deployment" takes all the tasks down. We could target individual tasks using "aws ecs stop-task", however, I cannot find a metric in cloudwatch that gives us this task-specific information. We can only seem to find cluster or service level averages.
Any helps would be appreciated.
could not comment so posting as an answer.
You could work it around with a hard memory limit set for the container with the memory param.
If your container attempts to exceed the memory specified here, the container is killed.
https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ContainerDefinition.html
I am currently testing how Azure Kubernetes handles failover for StatefulSets. I simulated a network partition by running sudo iptables -A INPUT -j DROP on one of my nodes, not perfect but good enough to test some things.
1). How can I reuse disks that are mounted to a failed node? Is there a way to manually release the disk and make it available to the rescheduled pod? It takes forever for the resources to be released after doing a force delete, sometimes this takes over an hour.
2). If I delete a node from the cluster all the resources are released after a certain amount of time. The problem is that in the Azure dashboard it still displays my cluster as using 3 nodes even if I have deleted one. Is there a way to manually add the deleted node back in or do I need to rebuild the cluster each time?
3). I most definitely do not want to use ReadWriteMany.
Basically what I want is for my StatefulSet pods to terminate and have the associated disks detach and then reschedule on a new node in the event of a network partition or a node failure. I know the pods will terminate in the event of a recovery from a network partition but I want control over the process myself or at least have it happen sooner.
Yes, just detach the disks manually from the portal (or powershell\cli\api\etc)
This is not supported, you should not do this. Scaling\Upgrading might fix this, but it might not
Okay, dont.
I am operating kubernetes.
There are many terminating pods.
And So many crond daemons are in place in VM.
both /var/log/messages and /var/log/crond are empty.
I don't know why crond daemon is occurred so many?
500 Crond daemons are excecuting.
ps -ef | grep crond | wc -l
648
and load average is 16
I want to know relations between crond and pod terminating on kubernetes.
How Could I dertermine ?
I checked /etc/rsyslog.conf - it's normal.
By default cron emails the program output to the user who owns a particular crontab, therefore you can check whether any of the emails have been delivered within default path /var/spool/mail.
When you have a long-running or continuous script that could never be finished in cron, it can produce a multiple cron processes appearing in the process list, therefore it might be useful to get a list with a tree view on `crontab specific parent/child processes:
pstree -ap| grep crond
I assume that you have a large CPU utilization on your VM, which can potentially degrade the overall performance and affect Kubernetes engine. Although, Kubernetes provides a comprehensive mechanism for managing Compute resources, it distributes resources allocated on a specific Node within Pods which are consuming CPU and RAM on that Node.
To check general resource utilization on a particular Node, you can use this command:
kubectl describe node <node-name>
To check a Pod terminating reason, you can use similar command as in the above example:
kubectl describe pod <pod_name>
However, when you require to dig deeper into the troubleshooting action on your Kubernetes cluster, I would recommend to look at the official Guide.
I am running a Kubernetes cluster on Azure deployed using "azure acs ...". Recently I noticed that the pods on one of the nodes were not responsive and that the CPU on the node was maxed out. I logged in, executed top and found that a process called "mdsd" was using up all available CPU.
When I killed that process with "sudo kill -9", the CPU usage returned to normal and my pods were working fine.
It seems to me like "mdsd" is part of the Azure linux monitoring framework. I installed omi-1.4.0-6.ssl_100.ulinux.x64.deb.
Is there a way to make sure that mdsd is not eating up all my CPU and stopping my pods from working properly?
I need to monitor RAM and CPU of Spark Applications that run on a stand alone Spark Cluster.
I have try to use java console and it work very well, but I need to monitor various applicationi and I need to set for each one a different java console port .
Behind a firewalls it becomes a very long and tedious job.
Is there a way to monitor applications from Spark UI for example or something else?
If you use Ubuntu, you may use htop.
To install, do
sudo apt-get install htop