PyTorch GPU memory management - pytorch

In my code, I want to replace values in the tensor given values of some indices are zero, for example
target_mac_out[avail_actions[:, 1:] == 0] = -9999999
But, it returns OOM
RuntimeError: CUDA out of memory. Tried to allocate 166.00 MiB (GPU 0; 10.76 GiB total capacity; 9.45 GiB already allocated; 4.75 MiB free; 9.71 GiB reserved in total by PyTorch)
I think there is no memory allocation because it just visits the tensor of target_mac_out and check the value and replace a new value for some indices.
Am I understanding right?

It's hard to guess since we do not even know the sizes if the involved tensors, but your indexing avail_actions[:, 1:] == 0 creates a temporary tensor that does require memory allocation.

The avail_actions[:, 1:] == 0 create a new tensor, and possibly the whole line itself create another tensor before delete the old one after finish the operation.
If speed is not a problem then you can just use for loop. Like
for i in range(target_mac_out.size(0)):
for j in range(target_mac_out.size(1)-1):
if target_mac_out[i, j+1] == 0:
target_mac_out[i, j+1] = -9999999

Related

Why does it take so long print the value of a GPU tensor in Pytorch?

I wrote this pytorch program to compute a 5000*5000 matrix multiplication on GPU, 100 iterations.
import torch
import numpy as np
import time
N = 5000
x1 = np.random.rand(N, N)
######## a 5000*5000 matrix multiplication on GPU, 100 iterations #######
x2 = torch.tensor(x1, dtype=torch.float32).to("cuda:0")
start_time = time.time()
for n in range(100):
G2 = x2.t() # x2
print(G2.size())
print("It takes", time.time() - start_time, "seconds to compute")
print("G2.device:", G2.device)
start_time2 = time.time()
# G4 = torch.zeros((5,5),device="cuda:0")
G4 = G2[:5, :5]
print("G4.device:", G4.device)
print("G4======", G4)
# G5=G4.cpu()
# print("G5.device:",G5.device)
print("It takes", time.time() - start_time2, "seconds to transfer or display")
Here is the result on my laptop:
torch.Size([5000, 5000])
It takes 0.22243595123291016 seconds to compute
G2.device: cuda:0
G4.device: cuda:0
G4====== tensor([[1636.3195, 1227.1913, 1252.6871, 1242.4584, 1235.8160],
[1227.1913, 1653.0522, 1260.2621, 1246.9526, 1250.2871],
[1252.6871, 1260.2621, 1685.1147, 1257.2373, 1266.2213],
[1242.4584, 1246.9526, 1257.2373, 1660.5951, 1239.5414],
[1235.8160, 1250.2871, 1266.2213, 1239.5414, 1670.0034]],
device='cuda:0')
It takes 60.13639569282532 seconds to transfer or display
Process finished with exit code 0
I am confused why it takes so much time to display the variable G5 on GPU, since is only 5*5 in size.
BTW, I use "G5=G4.cpu()" to transfer the variable on GPU to CPU, it takes so much time too.
My develop enviroment (rather old laptop):
pytorch 1.0.0
CUDA 8.0
Nvidia GeForce GT 730m
Windows 10 Professional
when increasing iteration times, the compute time do not increase obviously, but the transfer or display increased obviously, Why? Can somebody interpret, thanks so much.
Pytorch CUDA operations are asynchronous. Most operations on GPU tensors are actually non blocking until a derived result is requested. This means that until you ask for a CPU version of a tensor, commands like matrix multiply are basically being processed in parallel to your code. When you stop the timer there's no guarantee that the operation has been completed. You can read more about this in the docs.
To time chunks of your code properly you should add calls to torch.cuda.synchronize. This function should be called twice, once right before you start your timer, and once right before you stop your timer. Outside of profiling your code you should avoid calls to this function as it may slow down overall performance.

How to calculate space for number of records

I am trying to calculate space required by a dataset using below formula, but I am getting wrong somewhere when I cross check it with the existing dataset in the system. Please help me
1st Dataset:
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 32760
Number of records....: 51560
Using below formula to calculate
optimal block length (OBL) = 32760/record length = 32760/449 = 73
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 73*2 = 146
Find number of physical records (PR) = Number of records/TOBL = 51560/146 = 354
Number of tracks = PR/2 = 354/2 = 177
But I can below in the dataset information
Current Allocation
Allocated tracks . : 100
Allocated extents . : 1
Current Utilization
Used tracks . . . . : 100
Used extents . . . : 1
2nd Dataset :
Record format . . . : VB
Record length . . . : 445
Block size . . . . : 27998
Number of Records....: 127,252
Using below formula to calculate
optimal block length (OBL) = 27998/record length = 27998/449 = 63
As there are two blocks on the track, hence (TOBL) = 2 * OBL = 63*2 = 126
Find number of physical records (PR) = Number of records/TOBL = 127252/126 = 1010
Number of tracks = PR/2 = 1010/2 = 505
Number of Cylinders = 505/15 = 34
But I can below in the dataset information
Current Allocation
Allocated cylinders : 69
Allocated extents . : 1
Current Utilization
Used cylinders . . : 69
Used extents . . . : 1
A few observations on your approach.
First, since your dealing with records that are variable length it would be helpful to know the "average" record length as that would help to formulate a more accurate prediction of storage. Your approach assumes a worst case scenario of all records being at maximum which is fine for planning purposes but in reality you'll likely see the actual allocation would be lower if the average of the record lengths is lower than the maximum.
The approach you are taking is reasonable but consider that you can inform z/OS of the space requirements in blocks, records, DASD geometry or let DFSMS perform the calculation on your behalf. Refer to this article to get some additional information on options.
Back to your calculations:
You Optimum Block Length (OBL) is really a records per block (RPB) number. Block size divided maximum record length yields the number of records at full length that can be stored in the block. If your average record length is less then you can store more records per block.
The assumption of two blocks per track may be true for your situation but it depends on the actual device type that will be used for the underlying allocation. Here is a link to some of the geometries for supported DASD devices and their geometries.
Your assumption of two blocks per track depends on the device is not correct for 3390's as you would need 64k for two blocks on a track but as you can see the 3390's max out at 56k so you would only get one block per track on the device.
Also, it looks like you did factor in the RDW by adding 4 bytes but someone looking at the question might be confused if they are not familiar with V records on z/OS.In the case of your calculation that would be 61 records per block at 27998 (which is the "optimal block length" so two blocks can fit comfortable on a track).
I'll use the following values:
MaximumRecordLength = RecordLength + 4 for RDW
TotalRecords = Total Records at Maximum Length (worst case)
BlockSize = modeled blocksize
RecordsPerBlock = number of records that can fit in a block (worst case)
BlocksNeeded = number of blocks needed to contain estimated records (worst case)
BlocksPerTrack = from IBM device geometry information
TracksNeeded = TotalRecords / RecordsPerBlock / BlocksPerTrack
Cylinders = Device Tracks per cylinder (15 for most devices)
Example 1:
Total Records = 51,560
BlockSize = 32,760
BlocksPerTrack = 1 (from device table)
RecordsPerBlock: 32,760 / 449 = 72.96 (72)
Total Blocks = 51,560 / 72 = 716.11 (717)
Total Tracks = 717 * 1 = 717
Cylinders = 717 / 15 = 47.8 (48)
Example 2:
Total Records = 127,252
BlockSize = 27,998
BlocksPerTrack = 2 (from device table)
RecordsPerBlock: 27,998 / 449 = 62.35 (62)
Total Blocks = 127,252 / 62 = 2052.45 (2,053)
Total Tracks = 2,053 / 2 = 1,026.5 (1,027)
Cylinders = 1027 / 15 = 68.5 (69)
Now, as to the actual allocation. It depends on how you allocated the space, the size of the records. Assuming it was in JCL you could use the RLSE subparameter of the SPACE= to release space when the is created and closed. This should release unused resources.
Given that the records are Variable the estimates are worst case and you would need to know more about the average record lengths to understand the actual allocation in terms of actual space used.
Final thought, all of the work you're doing can be overridden by your storage administrator through ACS routines. I believe that most people today would specify a BLKSIZE=0 and let DFSMS do all of the hard work because that component has more information about where a file will go, what the underlying devices are and the most efficient way of doing the allocation. The days of disk geometry and allocation are more of a campfire story unless your environment has not been administered to do these things for you.
Instead of trying to calculate tracks or cylinders, go for MBs, or KBs. z/OS (DFSMS) will calculate for you, how many tracks or cylinders are needed.
In JCL it is not straight forward but also not too complicated, once you got it.
There is a DD statement parameter called AVGREC=, which is the trigger. Let me do an example for your first case above:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(445,(51560,1000)),AVGREC=U
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
Parameter AVGREC=U (4) tells the system three things:
Firstly, the first subparameter in SPACE= (1) shall be interpreted as an average record length. (Note that this value is completely independend of the value specified in LRECL=.)
Secondly, it tells the system, that the second (2), and third (3) SPACE= subparameter are the number of records of average length (1) that the data set shall be able to store.
Thirdly, it tells the system that numbers (2), and (3) are in records (AVGREC=U). Alternatives are thousands (AVGREC=M), and millions (AVGREC=M).
So, this DD statement will allocate enough space to hold the estimated number of records. You don't have to care for track capacity, block capacity, device geometry, etc.
Given the number of records you expect and the (average) record length, you can easily calculate the number of kilobytes or megabytes you need. Unfortunately, you cannot directly specify KB, or MB in JCL, but there is a way using AVGREC= as follows.
Your first data set will get 51560 records of (maximum) length 445, i.e. 22'944'200 bytes, or ~22'945 KB, or ~23 MB. The JCL for an allocation in KB looks like this:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(22945,10000)),AVGREC=K
//* | | | |
//* V V V V
//* (1) (2) (3) (4)
You want the system to allocate primary space for 22945 (2) thousands (4) records of length 1 byte (1), which is 22945 KB, and secondary space for 10'000 (3) thousands (4) records of length 1 byte (1), i.e. 10'000 KB.
Now the same alloation specifying MB:
//anydd DD DISP=(NEW,CATLG),
// DSN=your.new.data.set.name,
// REFCM=VB,LRECL=445,
// SPACE=(1,(23,10)),AVGREC=M
//* | | | |
//* V V V V
//* (1) (2)(3) (4)
You want the system to allocate primary space for 23 (2) millions (4) records of length 1 byte (1), which is 23 MB, and secondary space for 10 (3) millions (4) records of length 1 byte (1), i.e. 10 MB.
I rarely use anything other than the latter.
In ISPF, it is even easier: Data Set Allocation (3.2) allows KB, and MB as space units (amongst all the old ones).
A useful and usually simpler alternative to using SPACE and AVGREC etc is to simply use a DATACLAS for space if your site has appropriate sized ones defined. If you look at ISMF Option 4 you can list available DATACLAS's and see what space values etc they provide. You'd expect to see a number of ranges in size, and some with or without Extended Format and/or Compression. Even if a DATACLAS overallocates a bit then it is likely the overallocated space will be released by the MGMTCLAS assigned to the dataset at close or during space management. And you do have an option to code DATACLAS AND SPACE in which case any coded space (or other) value will override the DATACLAS, which helps with exceptions. It still depends how your Storage Admin's have coded the ACS routines but generally Users are allowed to specify a DATACLAS and it will be honored by the ACS routines.
For basic dataset size calculation I just use LRECL times the expected Max Record Count divided by 1000 a couple of times to get a rough MB figure. Obviously variable records/blks add 4bytes each for RDW and/or BDW but unless the number of records is massive or DASD is extremely tight for space wise it shouldn't be significant enough to matter.
e.g.
=(51560*445)/1000/1000 shows as ~23MB
Also, don't expect your allocation to be exactly what you requested because the minimum allocation on Z/OS is 1 track or ~56k. The BLKSIZE also comes into effect by adding interblock gaps of ~32bytes per block. With SDB (system Determined Blocksize) invoked by omitting BLKSIZE or coding BLKSIZE=0, it will always try to provide half track blocking as close to 28k as possible so two blocks per track which is the most space efficient. That does matter, a BLKSIZE of 80bytes wastes ~80% of a track with interblock gaps. The BLKSIZE is also the unit of transfer when doing read/write to disk so generally the larger the better with some exceptions such as KSDS's being randomly access by key for example which might result in more data transfer than desired in an OLTP transaction.

Memory consumption in Python - lists, subscripting, and pointers

I'm trying to understand how much memory Python objects use.
In the following code, I check the memory of a numpy array vs list as well as a subscripted numpy array:
import sys, os, psutil, numpy as np
def size_of(obj):
return f'{sys.getsizeof(obj) / 1000000:,.0f} MB'
def get_memory_usage():
process = psutil.Process(os.getpid())
return f'{process.memory_info().rss / 1000000:,.0f} MB'
# Numpy vs List
print(f'(1) Mem usage: {get_memory_usage()}')
ONE_HUNDRED_MIL_NP = np.random.randint(-128,127,int(10**8),dtype='int8')
print(f'(2) Mem usage: {get_memory_usage()}, ONE_HUNDRED_MIL_NP: {size_of(ONE_HUNDRED_MIL_NP)}')
ONE_HUNDRED_MIL_LIST = list(np.random.choice(127, int(10**8), replace=True).astype('int8'))
print(f'(3) Mem usage: {get_memory_usage()}, ONE_HUNDRED_MIL_LIST: {size_of(ONE_HUNDRED_MIL_LIST)}')
# Now try subscriping
FOURCOLS = np.random.randint(-128,127,size=(int(10**8),4),dtype='int8')
print(f'(4) Mem usage: {get_memory_usage()}, FOURCOLS: {size_of(FOURCOLS)}')
FOURCOLS_PERMUTED = FOURCOLS[np.random.randint(0,len(FOURCOLS),size=len(FOURCOLS),dtype='int32')]
print(f'(5) Mem usage: {get_memory_usage()}, FOURCOLS_PERMUTED: {size_of(FOURCOLS_PERMUTED)}')
This returns:
(1) Mem usage: 187 MB
(2) Mem usage: 287 MB, ONE_HUNDRED_MIL_NP: 100 MB
(3) Mem usage: 3,526 MB, ONE_HUNDRED_MIL_LIST: 900 MB
(4) Mem usage: 3,926 MB, FOURCOLS: 400 MB
(5) Mem usage: 4,326 MB, FOURCOLS_PERMUTED: 400 MB
Notes:
Output (2) makes sense. One int8 is 8 bits (one byte) and 100 million bytes is 100 MB
Output (3) I don't understand:
The first issue is that sys.getsizeof() shows the objects takes up 900 MB, but psutil shows that the process now takes up 3,239 MB more memory (3526-287=3239). Where is this phantom memory usage coming from?
Where does the 900 MB come from? (From Python: Size of Reference?, I'm assuming that there's 100 MB of the numpy object plus 100 million pointers, which are 8 bytes each, so 100 MB + 800 MB = 900 MB?)
Output (4) Makes sense. 400 million int8s is 400 MB.
Output (5) I don't understand. Is a copy being made or references? If references, we're only referencing 100 million rows, right? How does this make 400 MB?
Thanks
ONE_HUNDRED_MIL_NP = np.random.randint(-128,127,int(10**8),dtype='int8')
This makes an array. ONE_HUNDRED_MIL_NP.nbytes is a good measure of the array size. An array has some basic info like shape, strides, dtype, but the bulk of the space is a 1d data buffer that contains bytes, in this case on byte per element.
ONE_HUNDRED_MIL_LIST = list(np.random.choice(127, int(10**8), replace=True).astype('int8'))
This produces a list from that array. A list has a databuffer that contains references to objects else where in memory. getsizeof just measures that size of that buffer, and says nothing about the objects. Here those objects are numpy.int8 objects, extracted from the array. They don't actually reference elements of the array, but rather are copies of those values.
A better way to get a list from an array is with arr.tolist().
FOURCOLS = np.random.randint(-128,127,size=(int(10**8),4),dtype='int8')
This is just another array. the 2d shape doesn't change how much memory it takes up.
FOURCOLS_PERMUTED = FOURCOLS[np.random.randint(0,len(FOURCOLS),size=len(FOURCOLS),dtype='int32')]
This is an example of advanced indexing. It creates a new array with its own data buffer (not a view with a shared buffer). Yes, you are just indexing the first dimension, but the data buffer stores all values, not references to rows of FOURCOLS.
A list of lists does store references or points to the nested lists. So shuffling the outer list would just shuffle the references. Multidimensional c arrays also store references or pointers. But multidimensional numpy arrays use a different model. The data is a flat c array. Multidimensionality is produced by the shape/strides iteration code.
So looking at your numbers:
(1) Mem usage: 187 MB
base usage.
(2) Mem usage: 287 MB, ONE_HUNDRED_MIL_NP: 100 MB
adds 100mb to base.
(3) Mem usage: 3,526 MB, ONE_HUNDRED_MIL_LIST: 900 MB
The 900 is roughly the memory used by the list's data buffer. The rest of total usage increase is storage for those 10**8 np.int8 objects.
(4) Mem usage: 3,926 MB, FOURCOLS: 400 MB
This shows another 400 MB memory usage.
(5) Mem usage: 4,326 MB, FOURCOLS_PERMUTED: 400 MB
And yet another 400.
Without the (3) list creating mem usage should show an orderly increase by the array size.

saving and loading large numpy matrix

The below code is how I save the numpy array and it is about 27GB after saved. There are more than 200K images data and each shape is (224,224,3)
hf = h5py.File('cropped data/features_train.h5', 'w')
for i,each in enumerate(features_train):
hf.create_dataset(str(i), data=each)
hf.close()
This is the method I used to load the data, and it takes hours for loading.
features_train = np.zeros(shape=(1,224,224,3))
hf = h5py.File('cropped data/features_train.h5', 'r')
for key in hf.keys():
x = hf.get(key)
x = np.array(x)
features_train = np.append(features_train,np.array([x]),axis=0)
hf.close()
So, does anyone has a better solution for this large size of data?
You didn't tell us how much physical RAM your server has,
but 27 GiB sounds like "a lot".
Consider breaking your run into several smaller batches.
There is an old saw in java land that asks "why does this have quadratic runtime?",
that is, "why is this so slow?"
String s = ""
for (int i = 0; i < 1e6, i++) {
s += "x";
}
The answer is that toward the end,
on each iteration we are reading ~ a million characters
then writing them, then appending a single character.
The cost is O(1e12).
Standard solution is to use a StringBuilder so we're back
to the expected O(1e6).
Here, I worry that calling np.append() pushes us into the quadratic regime.
To verify, replace the features_train assignment with a simple evaluation
of np.array([x]), so we spend a moment computing and then immediately discarding
that value on each iteration.
If the conjecture is right, runtime will be much smaller.
To remedy it, avoid calling .append().
Rather, preallocate 27 GiB with np.zeros()
(or np.empty())
and then within the loop assign each freshly read array
into the offset of its preallocated slot.
Linear runtime will allow the task to complete much more quickly.

RuntimeError: size mismatch, m1: [4 x 3136], m2: [64 x 5] at c:\a\w\1\s\tmp_conda_3.7_1

I used python 3 and when i insert transform random crop size 224 it gives miss match error.
here my code
what did i wrong ?
Your code makes variations on resnet: you changed the number of channels, the number of bottlenecks at each "level", and you removed a "level" entirely. As a result, the dimension of the feature map you have at the end of layer3 is not 64: you have a larger spatial dimension than you anticipated by the nn.AvgPool2d(8). The error message you got actually tells you that the output of level3 is of shape 64x56x56 and after avg pooling with kernel and stride 8 you have 64x7x7=3136 dimensional feature vector, instead of only 64 you are expecting.
What can you do?
As opposed to "standard" resnet, you removed stride from conv1 and you do not have max pool after conv1. Moreover, you removed layer4 which also have a stride. Therefore, You can add pooling to your net to reduce the spatial dimensions of layer3.
Alternatively, you can replace nn.AvgPool(8) with nn.AdaptiveAvgPool2d([1, 1]) an avg pool that outputs only one feature regardless of the spatial dimensions of the input feature map.

Resources