How to increase pytorch timeout? - pytorch

I'm training ML model with yolov5 this is my command -
python3 -m torch.distributed.run --nproc_per_node 2 train.py --batch 100 --epochs 1000 --data /home/username/Documents/folder_name/numbers.yaml --weights yolov5s.pt --device 0,1 --hyp data/hyps/hyp.scratch-high.yaml --name folder_name --patience 0
It will cut out after 30min, because of the default pytorch timeout 1800s. How can I increase it?
https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
Thanks

Related

Docker desktop on linux is using a lot of ram

I'm running Docker desktop on Ubuntu 22.04. Every time I start it, it eats a lot of RAM.
PID USER %MEM COMMAND
135264 user 26.0 qemu-system-x86_64 -accel kvm -cpu host -machine q35 -m 3849 -smp 8 -kernel /opt/docker-desktop/linuxkit/kernel -append page_poison=1 vsyscall=emulate panic=1 nospec_store_bypass_disable noibrs noibpb no_stf_barrier mitigations=off linuxkit.unified_cgroup_hierarchy=1 vpnkit.connect=tcp+bootstrap+client://gateway.docker.internal:35817/95d4e7d4090b2d25b84ed2f2bd2e54523bafd0dfc2e2388838f04b9d045e0fe2 vpnkit.disable=osxfs-data console=ttyS0 -initrd /opt/docker-desktop/linuxkit/initrd.img -serial pipe:/tmp/qemu-console1696356651/fifo -drive if=none,file=/home/lev/.docker/desktop/vms/0/data/Docker.raw,format=raw,id=hd0 -device virtio-blk-pci,drive=hd0,serial=dummyserial -netdev user,id=net0,ipv6=off,net=192.168.65.0/24,dhcpstart=192.168.65.9 -device virtio-net-pci,netdev=net0 -vga none -nographic -monitor none -object memory-backend-memfd,id=mem,size=3849M,share=on -numa node,memdev=mem -chardev socket,id=char0,path=virtiofs.sock0 -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=virtiofs0
10422 user 2.3 /snap/firefox/1883/usr/lib/firefox/firefox
...
While docker ps shows that there are no containers running.
I've noticed that there is a mention of 3849M of memory in the command but I can't be entirely sure if it's related, plus it eats way more than 4 gigs.
Well, Docker uses all allocated memory at start, please see
https://github.com/docker/for-mac/issues/4229
You can set memory Limit on:
Dodcker Dashboard >> Settings >> Resources >> Apply and Restart
Otherwise, if you want to check how Resources are splitted between running container,
run docker stats to see memory usage of current running containers
See https://docs.docker.com/engine/reference/commandline/stats/
For Example:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
db6115785a9e 001_jan_twit 0.00% 35.71MiB / 7.774GiB 0.45% 38.6MB / 659kB 16.4kB / 222MB 2

Nrpe Not pulling date NRPE: Unable to read output

I'm trying to get memory metric from client machine. I installed nrpe in client machine and works well for default checks like load, users and all.
Manual output from client machine,
root#Nginx:~# /usr/lib/nagios/plugins/check_mem -w 50 -c 40
OK - 7199 MB (96%) Free Memory
But when i try from server, other metrics works but memory metrics not working,
[ec2-user#ip-10-0-2-179 ~]$ /usr/lib64/nagios/plugins/check_nrpe -H 107.XX.XX.XX -c check_mem
NRPE: Unable to read output
Other metrics works well
[ec2-user#ip-10-0-2-179 ~]$ /usr/lib64/nagios/plugins/check_nrpe -H 107.XX.XX.XX -c check_load
OK - load average: 0.00, 0.01, 0.05|load1=0.000;15.000;30.000;0; load5=0.010;10.000;25.000;0; load15=0.050;5.000;20.000;0;
I ensured that check_mem command has execution permission for all,
root#Nginx:~# ll /usr/lib/nagios/plugins/check_mem
-rwxr-xr-x 1 root root 2394 Sep 6 00:00 /usr/lib/nagios/plugins/check_mem*
Also here is my client side nrpe config commands
command[check_users]=/usr/lib/nagios/plugins/check_users -w 5 -c 10
command[check_load]=/usr/lib/nagios/plugins/check_load -w 15,10,5 -c 30,25,20
command[check_disk]=/usr/lib/nagios/plugins/check_disk -w 20% -c 10% -p /dev/sda1
command[check_zombie_procs]=/usr/lib/nagios/plugins/check_procs -w 5 -c 10 -s Z
command[check_procs]=/usr/lib/nagios/plugins/check_procs -w 200 -c 250
command[check_http]=/usr/lib/nagios/plugins/check_http -I 127.0.0.1
command[check_swap]=/usr/lib/nagios/plugins/check_swap -w 30 -c 20
command[check_mem]=/usr/lib/nagios/plugins/check_mem -w 30 -c 20
Can anyone help me to fix the issue?

Varnish process not closed and taking huge memory

We are using Varnish cache 4.1 in centos server, When we started Varnish server lots of varnish process starting and its not closing, due to this issue we are facing memory leak issue, pls let us know how we can resolve it
My Configuration is: /etc/sysconfig/varnish
#DAEMON_OPTS="-a :80 \
# -T localhost:6082 \
# -f /etc/varnish/default.vcl \
# -S /etc/varnish/secret \
# -p thread_pools=8 \
# -p thread_pool_max=4000 \
# -p thread_pool_add_delay=1 \
# -p send_timeout=30 \
# -p listen_depth=4096 \
# -s malloc,2G"
backend default {
.host = "127.0.0.1";
.port = "8080";
.probe = {
.url = "/";
.interval = 5s;
.timeout = 1s;
.window = 5;
.threshold = 3;
}
}
34514 89208 83360 5 0.0 4.3 0:00.00 /usr/sbin/varnishd -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 120 -p thread pool min=50 -p t 1678 varnish 20 0 345M 89208 83360 S 0.0 4.3 0:00.03 /usr/sbin/varnishd -a :80 -f /etc/varnish/default.vcl -T 127.0.0.1:6082 -t 120 -p thread_pool_min=50 -p • 1679 varnish 20 0
You are not limiting space for transient objects. By default an unlimited malloc is used (see the official doc : https://www.varnish-cache.org/docs/4.0/users-guide/storage-backends.html#transient-storage )
From what I see in your message, you are not using the parameter DAEMON_OPT.
What are the content of your varnishd.service file and /etc/varnish/varnish.params ?
EDIT
Nothing's wrong with your init.d script. It should use the settings found in /etc/sysconfig/varnish.
How many RAM is consumed by varnish?
All the varnish threads are sharing the same storage (malloc 2G + Transient malloc 100M) so it should take up to 2.1G for storage. you need to add an average overhead of 1KB per object stored in cache to get the total memory used.
I don't think you are suffering memory leak, the process are normal. You told varnish to spawn 50 processes (with the thread_pools parameter) so they are expected.
I'd recommend decreasing the number of thread_pools, you are setting it to 50. You should be able to lessen it to something between 2 and 8, at the same time it will help to increase the thread_pool_max to 5000and set the thread_pool_min to 1000.
We are running very large server with 2 pools * 1000-5000 threads and have no issue.

Does iperf have a bandwidth ceiling?

I am trying to run iperf and have a throughput of 1Gig. I'm using UDP so I expect the overhead to pretty much be minimal. Still, I see it capped at 600M despite my attempts.
I have been running:
iperf -c 172.31.1.1 -u -b 500M -l 1100
iperf -c 172.31.1.1 -u -b 1000M -l 1100
iperf -c 172.31.1.1 -u -b 1500M -l 1100
Yet Anything above 600 it seems to hit a limit of about 600. For example, the output for 1000M is:
[ 3] Server Report:
[ 3] 0.0-10.0 sec 716 MBytes 601 Mbits/sec 0.002 ms 6544/689154 (0.95%)
[ 3] 0.0-10.0 sec 1 datagrams received out-of-order
I'm running this on a server with a 10Gig port and even sending it right back to itself, so there should be no interface bottlenecks.
Unsure if I am running up against an iperf limit or if there is another way to get a true 1Gig test.

Auto scaling not happening on Amazon EC2

I am trying to perform auto scaling on Amazon EC2 with below commands:
elb-create-lb nalb1 --headers --listener "lb-port=80,instance-port=80,protocol=http" --availability-zones us-east-1c
elb-register-instances-with-lb nalb1 --headers --instances i-1ecef57c
elb-configure-healthcheck nalb1 --headers --target "HTTP:80/" --interval 30 --timeout 3 --unhealthy-threshold 2 --healthy-threshold 10
as-create-launch-config nalc1 --image-id ami-cdd306a4 --instance-type t1.micro
as-create-auto-scaling-group naasg1 --launch-configuration nalc1 --availability-zones us-east-1c --min-size 0 --max-size 10 --load-balancers nalb1
as-put-scaling-policy --auto-scaling-group naasg1 --name policy-scaleup --adjustment 100 --type PercentChangeInCapacity
as-put-scaling-policy --auto-scaling-group naasg1 --name policy-scaledown --adjustment=-1 --type ChangeInCapacity
as-create-or-update-trigger nat1 \
--auto-scaling-group naasg1 --namespace "AWS/EC2" \
--measure CPUUtilization --statistic Average \
--dimensions "AutoScalingGroupName=naasg1" \
--period 60 --lower-threshold 30 --upper-threshold 60 \
--lower-breach-increment=-1 --upper-breach-increment=1 \
--breach-duration 120
The following commands describe the status of various parameters once the above commands are hit.
root#domU-12-31-39-09-B8-12 ~# elb-describe-lbs
LOAD_BALANCER nalb1 nalb1-1717211844.us-east-1.elb.amazonaws.com 2012-01-24T09:45:11.440Z
root#domU-12-31-39-09-B8-12 ~# as-describe-launch-configs
LAUNCH-CONFIG nalc1 ami-cdd306a4 t1.micro
root#domU-12-31-39-09-B8-12 ~# as-describe-auto-scaling-groups
AUTO-SCALING-GROUP naasg1 nalc1 us-east-1c nalb1 0 10 0
root#domU-12-31-39-09-B8-12 ~# as-describe-policies
No policies found
root#domU-12-31-39-09-B8-12 ~# as-describe-triggers --auto-scaling-group naasg1
DEPRECATED: This command is deprecated and included only to facilitate migration to the new trigger mechanism. You should use this command for migration purposes only.
TRIGGER nat1 naasg1 NoData AWS/EC2 CPUUtilization Average 60
root#domU-12-31-39-09-B8-12 ~#
Despite all these, auto scaling is not happening
What might be the reason?
Thanks for help
The below command worked :)
elb-create-lb nalb1 --headers --listener "lb-port=80,instance-port=80,protocol=http" --availability-zones us-east-1c
elb-register-instances-with-lb nalb1 --headers --instances i-1ecef57c
elb-configure-healthcheck nalb1 --headers --target "HTTP:80/" --interval 30 --timeout 3 --unhealthy-threshold 2 --healthy-threshold 10
as-create-launch-config nalc1 --image-id ami-cdd306a4 --instance-type t1.micro
as-create-auto-scaling-group naasg1 --launch-configuration nalc1 --availability-zones us-east-1c --min-size 2 --max-size 10 --load-balancers nalb1
as-put-scaling-policy --auto-scaling-group naasg1 --name policy-scaleup --adjustment=2 --type ChangeInCapacity
as-put-scaling-policy --auto-scaling-group naasg1 --name policy-scaledown --adjustment=-1 --type ChangeInCapacity
as-set-desired-capacity naasg1 -c 2
Of course you need to create alarms on CloudWatch and associate these policies to two alarms, each handling step up and step down.

Resources