Why is pytorch running out of memory on a trivial multiplication? - pytorch

I'm trying to get the pytorch MNIST tutorial to run using WSL2/Ubuntu and RTX 3060 Ti GPU. On the first training batch it slurps up all the linux RAM until Ubuntu kills it.
After paring down the tutorial, I see the same failure with tiny tensors in this simple repro case.
import torch
x0 = torch.tensor([[1.], [4.]], device='cuda')
w0 = torch.tensor([[2.]], device='cuda')
y0 = torch.nn.functional.linear(x0, w0) <-- crashes here, should return tensor([[2.], [8.]])
jupyter kernel runs out of memory and dies
What I've tried:
Checking that the GPU can be seen from the shell and pytorch.cuda.is_available() == True
Creating the tensors locally rather than on the cuda device - this works.
Running the code through python command line rather than jupyter - fails.
Various NVIDIA windows drivers for cuda versions 11.4 to 12.0 - doesn't seem to matter.
Wiping and rebuilding the WSL Ubuntu instance - doesn't help.
$ conda list | grep torch
pytorch 1.13.1 py3.10_cuda11.7_cudnn8.5.0_0
pytorch-cuda 11.7 h67b0de4_1
$ nvidia-smi
Wed Feb 15 15:27:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.75 Driver Version: 517.40 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 On | N/A |
| 0% 39C P8 12W / 200W | 515MiB / 8192MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
ls -al /usr/lib/wsl/lib
total 74192
drwxr-xr-x 1 root root 40 Feb 15 15:23 .
drwxr-xr-x 4 root root 4096 Feb 15 06:13 ..
-r-xr-xr-x 1 root root 141464 Sep 12 16:54 libcuda.so
-r-xr-xr-x 1 root root 141464 Sep 12 16:54 libcuda.so.1
-r-xr-xr-x 1 root root 141464 Sep 12 16:54 libcuda.so.1.1
-r-xr-xr-x 1 root root 800568 Oct 7 18:46 libd3d12.so
-r-xr-xr-x 1 root root 6224608 Oct 7 18:46 libd3d12core.so
-r-xr-xr-x 1 root root 829248 Oct 7 18:46 libdxcore.so
-r-xr-xr-x 1 root root 5950624 Sep 12 16:54 libnvcuvid.so
-r-xr-xr-x 1 root root 5950624 Sep 12 16:54 libnvcuvid.so.1
-r-xr-xr-x 1 root root 7547400 Sep 12 16:54 libnvdxdlkernels.so
-r-xr-xr-x 1 root root 424400 Sep 12 16:54 libnvidia-encode.so
-r-xr-xr-x 1 root root 424400 Sep 12 16:54 libnvidia-encode.so.1
-r-xr-xr-x 1 root root 212624 Sep 12 16:54 libnvidia-ml.so.1
-r-xr-xr-x 1 root root 354768 Sep 12 16:54 libnvidia-opticalflow.so
-r-xr-xr-x 1 root root 354768 Sep 12 16:54 libnvidia-opticalflow.so.1
-r-xr-xr-x 1 root root 45845584 Sep 12 16:54 libnvwgf2umx.so
-r-xr-xr-x 1 root root 600472 Sep 12 16:54 nvidia-smi

I was able to get it working by making sure WSL is configured with more memory than the GPU. It seems NVIDIA's Unified Virtual Addressing (UVA) wants to map the RTX 3060 Ti's whole 8GB into linux's memory space on the first call? When I increased my WSL memory from 2GB to 16GB (via %USERPROFILE%\.wslconfig), my example and the pytorch tutorial started working.

Related

Patch command not working - can't find file to patch

I'm unable to apply the following patch from github into my docker container. I get the error can't find file.
https://patch-diff.githubusercontent.com/raw/ManageIQ/manageiq-providers-ansible_tower/pull/267.patch
root#0fec7605d8b9 manageiq-providers-ansible_tower-d5ec9817e49c]# patch < 267.patch
can't find file to patch at input line 16
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|From 36f36d6a9985d8df27ae35cbe13bf47f16309a69 Mon Sep 17 00:00:00 2001
|From: Adam Grare <adam#grare.com>
|Date: Thu, 14 Oct 2021 10:55:39 -0400
|Subject: [PATCH 1/3] Fix merge_extra_vars with nil variables
|
|---
| .../ansible_tower/automation_manager/configuration_script.rb | 2 +-
| .../automation_manager/configuration_workflow.rb | 2 +-
| .../providers/ansible_tower/automation_manager/job.rb | 4 ++--
| 3 files changed, 4 insertions(+), 4 deletions(-)
|
|diff --git a/app/models/manageiq/providers/ansible_tower/automation_manager/configuration_script.rb b/app/models/manageiq/providers/ansible_tower/automation_manager/configuration_script.rb
|index d133a97..83bad07 100644
|--- a/app/models/manageiq/providers/ansible_tower/automation_manager/configuration_script.rb
|+++ b/app/models/manageiq/providers/ansible_tower/automation_manager/configuration_script.rb
--------------------------
File to patch: q
q: No such file or directory
Skip this patch? [y] ^C
The files are inside the following directory
/opt/manageiq/manageiq-gemset/bundler/gems/manageiq-providers-ansible_tower-d5ec9817e49c
I'm trying to apply patch from within this directory. All the files are inside the app directory.
[root#0fec7605d8b9 manageiq-providers-ansible_tower-d5ec9817e49c]# ls -l
total 92
-rw-r--r-- 1 root root 20923 Dec 18 09:28 267.patch
drwxrwxr-x 1 root root 4096 Nov 29 01:44 app
drwxrwxr-x 3 root root 4096 Nov 29 01:44 bin
drwxrwxr-x 2 root root 4096 Nov 29 01:44 bundler.d
-rw-r--r-- 1 root root 5734 Jul 13 18:25 CHANGELOG.md
drwxrwxr-x 2 root root 4096 Nov 29 01:44 config
-rw-r--r-- 1 root root 603 Jul 13 18:25 Gemfile
drwxrwxr-x 5 root root 4096 Nov 29 01:44 lib
-rw-r--r-- 1 root root 11358 Jul 13 18:25 LICENSE.txt
drwxrwxr-x 2 root root 4096 Nov 29 01:44 locale
-rw-r--r-- 1 root root 6630 Jul 13 18:25 manageiq-providers-ansible_tower.gemspec
-rw-r--r-- 1 root root 339 Jul 13 18:25 Rakefile
-rw-r--r-- 1 root root 1735 Jul 13 18:25 README.md

How to install different versions of python on centOS 8?

I have python3.6 installed on CentOS Linux release 8.3
[fnord#fnord fnord]$ ls -ls /usr/bin/python*
0 lrwxrwxrwx 1 root root 9 Aug 31 2020 /usr/bin/python2 -> python2.7
12 -rwxr-xr-x 1 root root 8224 Aug 31 2020 /usr/bin/python2.7
0 lrwxrwxrwx. 1 root root 25 Jun 24 2020 /usr/bin/python3 -> /etc/alternatives/python3
0 lrwxrwxrwx 1 root root 31 Nov 4 2020 /usr/bin/python3.6 -> /usr/libexec/platform-python3.6
0 lrwxrwxrwx 1 root root 17 Nov 4 2020 /usr/bin/python3.6-config -> python3.6m-config
0 lrwxrwxrwx 1 root root 32 Nov 4 2020 /usr/bin/python3.6m -> /usr/libexec/platform-python3.6m
0 lrwxrwxrwx 1 root root 39 Nov 4 2020 /usr/bin/python3.6m-config -> /usr/libexec/platform-python3.6m-config
0 lrwxrwxrwx 1 root root 46 Nov 4 2020 /usr/bin/python3.6m-x86_64-config -> /usr/libexec/platform-python3.6m-x86_64-config
0 lrwxrwxrwx 1 root root 32 Mar 16 2021 /usr/bin/python3-config -> /etc/alternatives/python3-config
[fnord#fnord fnord]$ python3 --version
Python 3.6.8
[fnord#fnord fnord]$cat /etc/centos-release
CentOS Linux release 8.3.2011
How would I install python3.7, python3.8,... on the same system?
Use dnf e.g.
dnf install python38
Then you can use venv to manage which version of python you're using for a particular project

Why rsync --sparse produce bigger qcow2 files than original qcow2 files

Problem:
When i´am copying the whole disk with my virtual machines onto anemty disk with rsync --sparse the disk images (qcow2 files) on the new Disk are bigger then the original files.
Old Disk:
/dev/sda1 => /ssdstor
New Disk:
/dev/sdb1 => /new
Details:
Hardware:
2x SSD Curcial M500 960GB Firmware MU5
OS: Proxmox 3.4
Filesyste: XFS
Command:
rsync -axHv --force --progress --stats --sparse /ssdstor/ /new/
Rsync Version:
dpkg -L | grep rsync
ii rsync 993.1.1-1 amd64 fast, versatile, remote (and local) file-copying tool
file / disk comparison after first copy*
( to check everything was transfered correctly )
rsync -axHv --dry-run --force --progress --stats --sparse /ssdstor/ /new/
sending incremental file list
Number of files: 90,545 (reg: 70,269, dir: 9,395, link: 10,817, dev: 4, special: 60)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 0
Total file size: 634,456,255,674 bytes
Total transferred file size: 0 bytes
Literal data: 0 bytes
Matched data: 0 bytes
File list size: 65,536
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 2,097,654
Total bytes received: 9,993
sent 2,097,654 bytes received 9,993 bytes 1,405,098.00 bytes/sec
total size is 634,456,255,674 speedup is 301,025.86 (DRY RUN)
mount | egrep '(sda|sdb)'
/dev/sda1 on /ssdstor type xfs (rw,noatime,nodiratime,attr2,inode64,noquota)
/dev/sdb1 on /new type xfs (rw,noatime,nodiratime,attr2,inode64,noquota)
df -h | egrep '(sda|sdb)'
/dev/sda1 894G 388G 506G 44% /ssdstor
/dev/sdb1 894G 430G 465G 49% /new
ls -alshR /ssdstor | grep qcow2
77G -rw-r--r-- 1 root root 103G Jul 14 09:09 vm-100-disk-1.qcow2
6,2G -rw-r--r-- 1 root root 14G Jul 14 09:07 vm-101-disk-1.qcow2
2,0G -rw-r--r-- 1 root root 4,1G Jul 14 09:07 vm-101-disk-2.qcow2
17G -rw-r--r-- 1 root root 61G Feb 18 09:10 vm-102-disk-1.qcow2
40G -rw-r--r-- 1 root root 78G Jul 14 09:06 vm-103-disk-1.qcow2
40G -rw-r--r-- 1 root root 41G Jul 14 09:05 vm-103-disk-2.qcow2
31G -rw-r--r-- 1 root root 44G Jul 14 09:05 vm-104-disk-1.qcow2
5,2G -rw-r--r-- 1 root root 41G Mai 1 01:00 vm-105-disk-2.qcow2
63G -rw-r--r-- 1 root root 65G Jul 14 10:04 vm-106-disk-1.qcow2
26G -rw-r--r-- 1 root root 65G Jul 14 09:14 vm-107-disk-2.qcow2
51G -rw-r--r-- 1 root root 51G Mai 19 21:21 vm-108-disk-1.qcow2
ls -alshR /new | grep qcow2
79G -rw-r--r-- 1 root root 103G Jul 14 09:09 vm-100-disk-1.qcow2
6,2G -rw-r--r-- 1 root root 14G Jul 14 09:07 vm-101-disk-1.qcow2
2,0G -rw-r--r-- 1 root root 4,1G Jul 14 09:07 vm-101-disk-2.qcow2
17G -rw-r--r-- 1 root root 61G Feb 18 09:10 vm-102-disk-1.qcow2
40G -rw-r--r-- 1 root root 78G Jul 14 09:06 vm-103-disk-1.qcow2
41G -rw-r--r-- 1 root root 41G Jul 14 09:05 vm-103-disk-2.qcow2
37G -rw-r--r-- 1 root root 44G Jul 14 09:05 vm-104-disk-1.qcow2
34G -rw-r--r-- 1 root root 41G Mai 1 01:00 vm-105-disk-2.qcow2
63G -rw-r--r-- 1 root root 65G Jul 14 10:04 vm-106-disk-1.qcow2
33G -rw-r--r-- 1 root root 65G Jul 14 09:14 vm-107-disk-2.qcow2
51G -rw-r--r-- 1 root root 51G Mai 19 21:21 vm-108-disk-1.qcow2
Has anyone an idea?
More Tests:
cp --sparse=always vm-105-disk-2.qcow2 vm-105-disk-2.qcow2.new
5,2G -rw-r--r-- 1 root root 41G Jul 16 08:07 vm-105-disk-2.qcow2
34G -rw-r--r-- 1 root root 41G Jul 16 11:51 vm-105-disk-2.qcow2.new

Why apache2 and tomcat7 service can't autostartup on Ubuntu Cloudimg?

I use the official Ubuntu Cloudimg as my test environment and I download it at this page:
http://cloud-images.ubuntu.com/vagrant/raring/current/
I use vagrant + virtualbox to deploy the box file.
Everything is OK but apache2 ant tomcat7 service never autostartup and I try to do everything to solve it but I failed.
Here is what I have tried to do:
cat /var/log/boot.log
Cloud-init v. 0.7.2 running 'init-local' at Thu, 05 Sep 2013 01:15:26 +0000. Up 3.59 seconds.
cloud-init-nonet[3.70]: waiting 10 seconds for network device
rpcbind: Cannot open '/run/rpcbind/rpcbind.xdr' file for reading, errno 2 (No such file or directory)^M
rpcbind: Cannot open '/run/rpcbind/portmap.xdr' file for reading, errno 2 (No such file or directory)^M
cloud-init-nonet[4.74]: static networking is now up
Cloud-init v. 0.7.2 running 'init' at Thu, 05 Sep 2013 01:15:28 +0000. Up 4.89 seconds.
ci-info: +++++++++++++++++++++++++Net device info+++++++++++++++++++++++++
ci-info: +--------+------+-----------+---------------+-------------------+
ci-info: | Device | Up | Address | Mask | Hw-Address |
ci-info: +--------+------+-----------+---------------+-------------------+
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . |
ci-info: | eth0 | True | 10.0.2.15 | 255.255.255.0 | 08:00:27:ae:1a:5c |
ci-info: +--------+------+-----------+---------------+-------------------+
ci-info: ++++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++++
ci-info: +-------+-------------+----------+---------------+-----------+-------+
ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags |
ci-info: +-------+-------------+----------+---------------+-----------+-------+
ci-info: | 0 | 0.0.0.0 | 10.0.2.2 | 0.0.0.0 | eth0 | UG |
ci-info: | 1 | 10.0.2.0 | 0.0.0.0 | 255.255.255.0 | eth0 | U |
ci-info: +-------+-------------+----------+---------------+-----------+-------+
2013-09-05 09:15:28,570 - cloud-init[WARNING]: Stdout, stderr changing to (| tee -a /var/log/cloud-init-output.log, | tee -a /var/log/cloud-init-output.log)
Strangely, only the first boot up logs. Never write any logs in boot.log after that.
ll /etc/rc2.d/
total 12
drwxr-xr-x 2 root root 4096 Sep 3 10:31 ./
drwxr-xr-x 116 root root 4096 Nov 8 11:26 ../
-rw-r--r-- 1 root root 677 Jan 30 2013 README
lrwxrwxrwx 1 root root 17 Sep 2 17:09 S20postfix -> ../init.d/postfix*
lrwxrwxrwx 1 root root 22 Sep 2 17:09 S20redis-server -> ../init.d/redis-server*
lrwxrwxrwx 1 root root 32 Aug 31 12:06 S20virtualbox-guest-utils -> ../init.d/virtualbox-guest-utils*
lrwxrwxrwx 1 root root 16 Aug 31 12:06 S21puppet -> ../init.d/puppet*
lrwxrwxrwx 1 root root 13 Sep 3 10:31 S23ntp -> ../init.d/ntp*
lrwxrwxrwx 1 root root 26 Aug 31 11:36 S45landscape-client -> ../init.d/landscape-client*
lrwxrwxrwx 1 root root 15 Aug 31 11:36 S50rsync -> ../init.d/rsync*
lrwxrwxrwx 1 root root 19 Aug 31 11:36 S70dns-clean -> ../init.d/dns-clean*
lrwxrwxrwx 1 root root 18 Aug 31 11:36 S70pppd-dns -> ../init.d/pppd-dns*
lrwxrwxrwx 1 root root 14 Aug 31 11:34 S75sudo -> ../init.d/sudo*
lrwxrwxrwx 1 root root 17 Sep 2 17:08 S91apache2 -> ../init.d/apache2*
lrwxrwxrwx 1 root root 17 Sep 2 17:07 S92tomcat7 -> ../init.d/tomcat7*
lrwxrwxrwx 1 root root 21 Aug 31 12:06 S99chef-client -> ../init.d/chef-client*
lrwxrwxrwx 1 root root 21 Aug 31 11:36 S99grub-common -> ../init.d/grub-common*
lrwxrwxrwx 1 root root 18 Aug 31 11:34 S99ondemand -> ../init.d/ondemand*
lrwxrwxrwx 1 root root 18 Aug 31 11:34 S99rc.local -> ../init.d/rc.local*
You can see that apache2 and tomcat7 service begin with S.
But I have to use command service apache2 start ; service tomcat7 start to start service manually after every bootup time.
But mysql service is nomal. This service can autostartup.
What's wrong? How could I solve this problem?
I've been having a similar problem. I think it is caused by this bug:
sysv init scripts not started on boot with cloud ubuntu raring
Although the bug is yet unconfirmed, the workaround described in the bug report (using "console log" in /etc/init/rc.conf) worked for me.

How to replace a column data in linux output - may be using awk sed etc

Basically I'm trying to compare the output of two files(one from PC which converts .cpio file to the layout format using standard tools, and another from embedded device using busybox tools), both of them produces the file system file/directory layout in 'ls -l' format. But the problem i got now is the output from embedded device prints the directory with some size info, which is not present .cpio.
Hence i have decided to replace directory content lines with 0 bytes size for the size column.
output from device :-
drwxrwxr-x 2 root root 5512 Aug 22 2013 bin
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/addgroup -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/adduser -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/ash -> busybox
output from PC :-
drwxrwxr-x 2 root root 0 Aug 22 09:32 bin
lrwxrwxrwx 1 root root 7 Aug 22 09:24 bin/addgroup -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 09:24 bin/adduser -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 09:24 bin/ash -> busybox
By comparing the output i have two issues to fix.
1) directory size is not shown correctly, i want to use awk/sed to replace this with '0' on device side.
2) similarly '09:32' time needs to be replaced with '2013', if i know how to do the first i will do the second by myself.
Please share you ideas to fix this.
$ cat foo.input
drwxrwxr-x 2 root root 5512 Aug 22 2013 bin
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/addgroup -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/adduser -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/ash -> busybox
$ cat foo2.input
drwxrwxr-x 2 root root 0 Aug 22 09:32 bin
lrwxrwxrwx 1 root root 7 Aug 22 09:24 bin/addgroup -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 09:24 bin/adduser -> busybox
lrwxrwxrwx 1 root root 12345 Aug 22 09:24 bin/ash -> busybox
$ diff <(awk '/^d/{$5=0}{$6=$7=$8=""}1' foo.input) <(awk '/^d/{$5=0}{$6=$7=$8=""}1' foo2.input)
4c4
< lrwxrwxrwx 1 root root 7 bin/ash -> busybox
---
> lrwxrwxrwx 1 root root 12345 bin/ash -> busybox
/^d/{$5=0} sets field (column) 5 to 0 if the line matches ^d (= starts with d = directory)
{$6=$7=$8=""} deletes fields 6, 7, 8 for all lines since you want to ignore the date in your output
print results and diff their output
Preserving spacing:
$ awk '
BEGIN{ preRE="^([^[:space:]]+[[:space:]]+){4}" }
/^d/{
match($0,preRE)
preLgth=RLENGTH
match($0,preRE "[^[:space:]]+")
strLgth=RLENGTH-preLgth
$0 = substr($0,1,preLgth) sprintf("%*s",strLgth,0) substr($0,preLgth+strLgth+1)
}
1
' file
drwxrwxr-x 2 root root 0 Aug 22 2013 bin
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/addgroup -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/adduser -> busybox
lrwxrwxrwx 1 root root 7 Aug 22 2013 bin/ash -> busybox

Resources