I tried using Cuda in Pytorch in my set up but it can't be detected and I am puzzled as to why.
torch.cuda.is_available()
return False. Digging deeper,
torch._C._cuda_getDeviceCount()
returns 0. Using version 1.5, e.g.
$ pip freeze | grep torch
torch==1.5.0
I tried to write a small C program to do the same, e.g.
#include <stdio.h>
#include <cuda_runtime_api.h>
int main() {
int count = 0;
cudaGetDeviceCount(&count);
printf("Device count: %d\n", count);
return 0;
}
prints 1, so the Cuda runtime can obviously find a device. Also, running nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 00000000:02:00.0 On | N/A |
| 0% 41C P8 9W / 200W | 219MiB / 6075MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
So where did my Cuda device disappear in Python?
I now just realized that there is a different version if Pytorch for every different minor version of CUDA, so in my case version torch==1.5.0 defaults to CUDA 10.2 apparently, while the special package torch==1.5.0+cu101 works.
I hope this clears things up for other people who like me start reading the docs on PyPi (more up to date docs if you know where to look are here: https://pytorch.org/get-started/locally/)
Related
I use python3 to do some encrypted calculation with MICROSOFT SEAL and is looking for some performance improvement.
I do it by:
create a shared memory to hold the plaintext data
(Use numpy array in shared memory for multiprocessing)
start multiple processes with multiprocessing.Process (there is a param controlling the number of processes, thus limiting the cpu usage)
processes read from shared memory and do some encrypted calculation
wait for calculation ends and join processes
I run this program on a 32U64G x86 linux server, cpu model is: Intel(R) Xeon(R) Gold 6161 CPU # 2.20GHz.
I notice that if I double the number of processes there is only about 20% time cost improvement.
I've tried three kinds of process nums:
| process nums | 7 | 13 | 27 |
| time ratio | 0.8 | 1 | 1.2 |
Why is this improvement disproportionate to the resources i use (cpu & memory)?
Conceptual knowledge or specific linux cmdlines are both welcome.
Thanks.
FYI:
My code of sub processes is like:
def sub_process_main(encrypted_bytes, plaintext_array, result_queue):
// init
// status_sign
while shared_int > 0:
// seal load and some other calculation
encrypted_matrix_list = seal.Ciphertext.load(encrypted_bytes)
shared_plaintext_matrix = seal.Encoder.encode(plaintext_array)
// ... do something
for some loop:
time1 = time.time()
res = []
for i in range(len(encrypted_matrix_list)):
enc = seal.evaluator.multiply_plain(encrypted_matrix_list[i], shared_plaintext_matrix[i])
res.append(enc)
time2 = time.time()
print(f'time usage: {time2 - time1}')
// ... do something
result_queue.put(final_result)
I actually print the time for every part of my code and here is the time cost for this part of code.
| process nums | 13 | 27 |
| occurrence | 1791 | 864 |
| total time | 1698.2140 | 1162.8330 |
| average | 0.9482 | 1.3459 |
I've monitored some metrics but I don't know if there are any abnormal ones.
13 cores:
top
pidstat
vmstat
27 cores:
top (Why is this using all cores rather than exactly 27 cores? Does it have anything to do with Hyper-Threading?)
pidstat
vmstat
I am using tensorflow 2.3 dedicated with 2-GPU's. I am using styleformer model to get informal to formal sentences. I want to use all 2-GPU's for this task.
Here is the information about GPU:
!nvidia-smi
| NVIDIA-SMI XXX.XX.XX Driver Version: ******** CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 35C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
from tensorflow.python.client import device_lib
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
['/device:GPU:0', '/device:GPU:1']
Code that I am using on GPU
from styleformer import Styleformer
import torch
import warnings
sf = Styleformer(style = 0)
source_sentences = [
"I am quitting my job",
"Jimmy is on crack and can't trust him",
"What do guys do to show that they like a gal?"
]
for source_sentence in source_sentences:
target_sentence = sf.transfer(source_sentence, inference_on=1, quality_filter=0.95, max_candidates=5)
In the above code inference_on=1 means we are using GPU. But how can I ensure it's using both GPU. I went to the transfer function inside styleformer package and found this line..
def transfer(self, input_sentence, inference_on=0, quality_filter=0.95, max_candidates=5):
if self.model_loaded:
if inference_on == 0:
device = "cpu"
elif inference_on == 1:
device = "cuda:0"
else:
device = "cpu"
print("Onnx + Quantisation is not supported in the pre-release...stay tuned.")
How can I change the above code to use both GPU's?
your model is not using the GPU, there could be multiple reasons,
you did not install the Cuda toolkit & drivers for the GPU to access in the developer mode
Nvidia Driver issue ( uninstall & install)
Tensorflow version issue
check all these and try again. The notebook will take GPU automatically if it is available for use if you have everything installed.
When it is running on GPU, you will see 0MiB / 32510MiB will change to more then 0MiB.
Vote it if you found it correct !!! will help others to follow the same !!
I wrote an OpenCL function to increase 64-bits float point value in an array.
But the results is different between CPU and GPU.
import numpy as np
import pyopencl as cl
CL_INC = '''
__kernel void inc_f64(__global const double *a_g, __global double *res_g)
{
int gid = get_global_id(0);
res_g[gid] = a_g[gid] + 1.0;
}
'''
def test(dev_type):
ctx = cl.Context(dev_type=dev_type)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
prg = cl.Program(ctx, CL_INC).build()
in_py = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
out_py = np.empty_like(in_py)
in_cl = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=in_py)
out_cl = cl.Buffer(ctx, mf.WRITE_ONLY, in_py.nbytes)
prg.inc_f64(queue, in_py.shape, None, in_cl, out_cl)
cl.enqueue_copy(queue, out_py, out_cl)
queue.finish()
return out_py
print('Run inc_f64() on CPU: ', end='')
print(test(cl.device_type.CPU))
print('Run inc_f64() on GPU: ', end='')
print(test(cl.device_type.GPU))
Output:
Run inc_f64() on CPU: [2. 3. 4. 5. 6.]
Run inc_f64() on GPU: [2.40000038e+001 3.20000076e+001 5.26354425e-315 0.00000000e+000
0.00000000e+000]
Hardware informations:
[0] Apple / OpenCL 1.2 (Oct 31 2017 18:30:00)
|- [0:0] CPU / OpenCL 1.2 / Intel(R) Core(TM) i7-3667U CPU # 2.00GHz
|- [0:1] GPU / OpenCL 1.2 / HD Graphics 4000
Is it a hardware limitation or just a bug in the source code?
Your GPU probably doesn't support double-precision floating point numbers. Have you checked for support for the cl_khr_fp64 extension?
Your kernel must also declare its requirement:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
For more details, see the cl_khr_fp64 extension documentation.
I looked at how much RAM was used by Rust programs (RES column from top command) and I wonder why they use so much memory.
Here is an example:
use std::io;
fn main() {
println!("What's your name?");
let mut input = String::new();
io::stdin().read_line(&mut input).unwrap();
println!("Hello {}!", input);
}
I saw that 6 MB of memory was used before I input something.
Here is how I compiled and executed the program:
cargo build --release
./target/release/main
The equivalent C program:
#include <stdio.h>
int main(void) {
printf("What's your name?\n");
char input[100] = {0};
scanf("%s", input);
printf("Hello %s!\n", input);
return 0;
}
only uses 0.6 MB. In this case, the Rust program uses 10 times more memory. In other cases, I saw that the Rust program uses 5 times more memory.
I also tested with other languages to compare.
The OCaml version:
let () =
print_endline "What's your name?";
let line = read_line () in
print_string "Hello ";
print_endline line
uses 1 MB.
The Haskell version:
main = do
putStrLn "What's your name?"
name <- getLine
putStrLn ("Hello " ++ name ++ "!")
uses 3 MB.
The Python version:
print("What's your name?")
name = input()
print("Hello", name, "!")
uses 7 MB, almost the same as the Rust version!
Update
I'm running Linux (ArchLinux) with Rust 1.3 (I also tried the nightly with similar results).
Update 2
Here is more data from the htop command:
VIRT RES SHR MEM% Command
15572 2936 804 0.1 ocaml
21728 2732 2528 0.1 haskell
22540 7480 4308 0.2 python
4056 668 600 0.0 c
24180 6164 1928 0.2 rust
Update 3
I did more tests with massif to see the memory usage.
For every program, I ran massif twice, as following:
valgrind --tool=massif --time-unit=B ./program
valgrind --tool=massif --pages-as-heap=yes --time-unit=B ./program
Here are the results with all the programs (as shown by ms_print):
C versions:
https://framabin.org/?dd243f8ec99155bc#Af5cPrcHnz3DsWiOStfwgW8Qq6BTVhogz/46L+sMuSs=
https://framabin.org/?261b9366c3749469#1ztDBkgVly9CanrrWWrJdh3yBFL5PEIW3OI5OLnze/Q=
Rust versions:
https://framabin.org/?0f1bac1c750e97bf#AXwlFYYPHeazq9LfsTOpRBaUTTkb1NfN9ExPorDJud0=
https://framabin.org/?c24b21b01af36782#OLFWdwLjVG2t7eoLqLFhe0Pp8Q8pA2S/oq4jdRRWPzI=
OCaml versions:
https://framabin.org/?060f05bea318109c#/OJQ8reHCU3CzzJ5NCOCLOYJQFnA1VgxqAIVjgQWX9I=
https://framabin.org/?8ff1ffb6d03cb37a#GN8bq3Wrm6tNWaINIhMAr4ieltLtOPjuZ4Ynof9bV4w=
Haskell versions:
https://framabin.org/?b204bd978b8c1fd8#DyQH862AM8NEPTKlzEcZgoapPaZLdlF9W3dRn47K5yU=
https://framabin.org/?ac1aa89fcaeb782c#TQ+uAiqerjHuuEEIhehVitjm63nc3wu5wfivAeBH5uI=
Python versions:
https://framabin.org/?197e8b90df5373ec#aOi0+tEj32Na5jW66Kl97q2lsjSZ2x7Cwl/pOt0lYIM=
https://framabin.org/?397efa22484e3992#1ylOrmjKaA9Hg7gw7H7rKGM0MyxuvKwPNN1J/jLEMrk=
Summary (ram usage):
|------------|----------|----------|----------|----------|----------|
| | C | Haskell | OCaml | Rust | Python |
|------------|----------|----------|----------|----------|----------|
| First run | 1 B | 63.12 KB | 5.993 MB | 816 B | 1.321 MB |
|------------|----------|----------|----------|----------|----------|
| Second run | 6.031 MB | 24.20 MB | 17.14 MB | 25.60 MB | 27.43 MB |
|------------|----------|----------|----------|----------|----------|
The first run is without the --pages-as-heap=yes parameter.
I also ran massif with the --stacks=yes option for C and Rust.
C version:
https://framabin.org/?b3009d198ccfdee1#HxR6LPPAzt15K+wIFdaqlfSJjBrJvhV2ZHWdElg3ezc=
(3.141 KB)
Rust version:
https://framabin.org/?b446d8d76c279007#tHnGiOnRstTA2krhz6cgfvTjI+FclcZS3rqyZvquWdQ=
(8.602 KB)
What does explain such a huge difference between heap block allocation and page allocation in Rust?
Because the standard library is statically linked.
You can overcome this by compiling with the -C prefer-dynamic option.
As to the reason behind having the standard library statically linked: it increases executable portability (ie: no need for the standard library to be installed in target system).
Since this question is on top results from google, I would like to give an update for anybody looking at this in 2022. I ran the exact same program and measured rust RSS from htop. It shows 924KB. That is 0.92MB. Apparently rust has improved a lot in these years.
This article has a very good discussion of the topic. Some of the largest and most common culprits are cargo's default to debug builds (not relevant in your case) and statically including libraries by default.
Of 192GB RAM installed on my computer, I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time (mem=4G memmap=188G$4G). A data acquisition kernel modules accumulates data into this large area used as a ring buffer using DMA. A user space application mmap's this ring buffer into user space, then copies blocks from the ring buffer at the current location for processing once they are ready.
Copying these 16MB blocks from the mmap'ed area using memcpy does not perform as I expected. It appears that the performance depends on the size of the memory reserved at boot time (and later mmap'ed into user space). http://www.wurmsdobler.org/files/resmem.zip contains the source code for a kernel module which does implements the mmap file operation:
module_param(resmem_hwaddr, ulong, S_IRUSR);
module_param(resmem_length, ulong, S_IRUSR);
//...
static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {
remap_pfn_range(vma, vma->vm_start,
resmem_hwaddr >> PAGE_SHIFT,
resmem_length, vma->vm_page_prot);
return 0;
}
and a test application, which does in essence (with the checks removed):
#define BLOCKSIZE ((size_t)16*1024*1024)
int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);
unsigned long resMemLength = 0;
::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);
void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);
char* source = ((char*)resMemBase) + RESMEM_HEADER_SIZE;
char* destination = new char[BLOCKSIZE];
struct timeval start, end;
gettimeofday(&start, NULL);
memcpy(destination, source, BLOCKSIZE);
gettimeofday(&end, NULL);
float time = (end.tv_sec - start.tv_sec)*1000.0f + (end.tv_usec - start.tv_usec)/1000.0f;
std::cout << "memcpy from mmap'ed to malloc'ed: " << time << "ms (" << BLOCKSIZE/1000.0f/time << "MB/s)" << std::endl;
I have carried out memcpy tests of a 16MB data block for the different sizes of reserved RAM (resmem_length) on Ubuntu 10.04.4, Linux 2.6.32, on a SuperMicro 1026GT-TF-FM109:
| | 1GB | 4GB | 16GB | 64GB | 128GB | 188GB
|run 1 | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) | 9.326ms (1798.97MB/s) | 213.892ms ( 78.43MB/s) | 206.476ms ( 81.25MB/s)
|run 2 | 4.255ms (3942.94MB/s) | 4.249ms (3948.51MB/s) | 4.257ms (3941.09MB/s) | 4.298ms (3903.49MB/s) | 208.269ms ( 80.55MB/s) | 200.627ms ( 83.62MB/s)
My observations are:
From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cached somewhere.
There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy.
I would like to understand why that so is. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)
Kind regards,
peter
Based on feedback from SuperMicro, the performance degradation is due to NUMA, non-uniform memory access. The SuperMicro 1026GT-TF-FM109 uses the X8DTG-DF motherboard with one Intel 5520 Tylersburg chipset at its heart, connected to two Intel Xeon E5620 CPUs, each of which has 96GB RAM attached.
If I lock my application to CPU0, I can observe different memcpy speeds depending on what memory area was reserved and consequently mmap'ed. If the reserved memory area is off-CPU, then mmap struggles for some time to do its work, and any subsequent memcpy to and from the "remote" area consumes more time (data block size = 16MB):
resmem=64G$4G (inside CPU0 realm): 3949MB/s
resmem=64G$96G (outside CPU0 realm): 82MB/s
resmem=64G$128G (outside CPU0 realm): 3948MB/s
resmem=92G$4G (inside CPU0 realm): 3966MB/s
resmem=92G$100G (outside CPU0 realm): 57MB/s
It nearly makes sense. Only the third case, 64G$128, which means the uppermost 64GB also yield good results. This contradicts somehow the theory.
Regards,
peter
Your CPU probably doesn't have enough cache to deal with it efficiently. Either use lower memory, or get a CPU with a bigger cache.