one problem about distributed training when i pretrain the BART

one problem about distributed training when i pretrain the BART - linux

I only used less than 7,000 pieces of data to pretrain the bart and use eight 3080 graphics cards,I should have enough memory,however,it report out of memory.Here is some code
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
python -m torch.distributed.launch $DISTRIBUTED_ARGS \
pretrain_bart.py \
--num-layers 12 \
--hidden-size 1024 \
--num-attention-heads 16 \
--micro-batch-size 1 \
--global-batch-size 8 \
As you can see, my batchsize has been set to very small，next is error report
RuntimeError: CUDA out of memory. Tried to allocate 42.00 MiB (GPU 2; 9.78 GiB total capacity; 7.63 GiB already allocated; 4.56 MiB free; 7.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This error occurs 8 times, once per GPU，I always feel like I'm still training with one GPU instead of eight together, so how do I solve this problem

Related

Understanding cpu shares in cloud foundry

According to this documentation, CPU shares is calculated as
process_cpu.shares = min( 1024*(application_memory / 8 GB), 1024)
According to this formula, if an application is assigned 1GB memory, then it should get 128 CPU shares. 1024*(1/8). However, if we SSH into the application and check the cpu.shares we get 122
cat /sys/fs/cgroup/cpu/cpu.shares
122
Here are the observations:
Memory
Calculated cpu.share
Observed cpu.share
Difference
1GB
128
122
~5% difference
1.5GB/1536MB
192
184
~5% difference
2GB
256
256
3GB
384
384
4GB
512
512
5GB
640
634
~1% difference
5.5GB/5632MB
704
696
~2% difference
8GB
1024
1024
Why is it that for some values there is this discrepancy (Such as 1G,1.5G,5G etc..) while others (like 2,3,4 - 6,7,8) are consistent with the calculation. I believe that I am missing something from Cgroups perspective for this calculation. Is this specific to CF or is it something to do with the way Linux calculates resource allocation in Cgroups in general? Is there a headroom always reserved?

Colab pro does not provide more than 16 gb of ram

Today i upgraded my account to Colab pro. Although it prints the ram as:
Your runtime has 27.3 gigabytes of available RAM
You are using a high-RAM runtime!
when I start training my model, it gives the error below.
RuntimeError: CUDA out of memory. Tried to allocate 88.00 MiB (GPU 0; 15.90 GiB total capacity; 14.75 GiB already allocated; 75.75 MiB free; 14.95 GiB reserved in total by PyTorch)
Hyperparameters of my model:
args_dict = dict(
#data_dir="", # path for data files
output_dir="", # path to save the checkpoints
model_name_or_path='t5-large',
tokenizer_name_or_path='t5-large',
max_seq_length=600,
learning_rate=3e-4,
weight_decay=0.0,
adam_epsilon=1e-8,
warmup_steps=0,
train_batch_size=4,
eval_batch_size=4,
num_train_epochs=2,
gradient_accumulation_steps=16,
n_gpu=1,
early_stop_callback=False,
fp_16=True, # if you want to enable 16-bit training then install apex and set this to true
opt_level='O1', # you can find out more on optimisation levels here https://nvidia.github.io/apex/amp.html#opt-levels-and-properties
max_grad_norm=1.0, # if you enable 16-bit training then set this to a sensible value, 0.5 is a good default
seed=42,
)
Colab pro not providing all ram. My code only works if train_batch_size = 1. What causes this? Any ideas?
Note: I get the same error when I run the code in Kaggle (16Gb). So, what I get with colab pro?

Looking at your error, the 16 GB are referring to the graphics card, not the ram.
As far as I know, using colab-pro enables you to use a graphics card with up to 16GB of VRAM.
You can check the VRAM amount by running the following code.
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
print('and then re-execute this cell.')
else:
print(gpu_info)
Maybe you use a smaller batch size than 4?

File sizes are reported differently

Why are file sizes all different?
In Windows 10 I can see all of these sizes:
11,116 KB
10.8 MB
11,382,240 Bytes
11,382,784 Bytes
If I use the Console Window:
D:\My Programs\2017\MeetSchedAssist\Inno\Output>dir *.exe
Volume in drive D is DATA
Volume Serial Number is A8B0-A5C6
Directory of D:\My Programs\2017\MeetSchedAssist\Inno\Output
03/04/2018 08:50 11,382,240 MeetSchedAssistSetup.exe
1 File(s) 11,382,240 bytes
0 Dir(s) 719,837,487,104 bytes free
D:\My Programs\2017\MeetSchedAssist\Inno\Output>
I understand that perhaps on the physical media it has to round it to physically take a certain amount of space, but that line above:
Size: 10.8 MB (11,382,240 bytes)
Huh? Why does it not say 11.38 MB?

Once upon a time it has been defined that
1 kB = 1024 B
1 MB = 1024 kB
If you divide your bytes figure all the way down to MB, you'll get all those figures.
Now that they noticed that many people tend to walk into that trap, they have redefined the unit multiples and defined new ones
1 kiB = 1024 B
1 MiB = 1024 kiB
1 kB = 1000 B
1 MB = 1000 kB
but this scheme is not so widespread (seems to be more common with total size specs of storage media).
Funny sidenote: I guess I am not the only one who has learned it the old way and now mixes it up with the current definition all the time. I'd say problems like this are the root cause for humanity being mostly conservatively oriented.

Making sense from GHC profiler

I'm trying to make sense from GHC profiler. There is a rather simple app, which uses werq and lens-aeson libraries, and while learning about GHC profiling, I decided to play with it a bit.
Using different options (time tool, +RTS -p -RTS and +RTS -p -h) I acquired entirely different numbers of my memory usage. Having all those numbers, I'm now completely lost trying to understand what is going on, and how much memory the app actually uses.
This situation reminds me the phrase by Arthur Bloch: "A man with a watch knows what time it is. A man with two watches is never sure."
Can you, please, suggest me, how I can read all those numbers, and what is the meaning of each of them.
Here are the numbers:
time -l reports around 19M
#/usr/bin/time -l ./simple-wreq
...
3.02 real 0.39 user 0.17 sys
19070976 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
21040 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
71 messages sent
71 messages received
2991 signals received
43 voluntary context switches
6490 involuntary context switches
Using +RTS -p -RTS flag reports around 92M. Although it says "total alloc" it seems strange to me, that a simple app like this one can allocate and release 91M
# ./simple-wreq +RTS -p -RTS
# cat simple-wreq.prof
Fri Oct 14 15:08 2016 Time and Allocation Profiling Report (Final)
simple-wreq +RTS -N -p -RTS
total time = 0.07 secs (69 ticks # 1000 us, 1 processor)
total alloc = 91,905,888 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
main.g Main 60.9 88.8
MAIN MAIN 24.6 2.5
decodeLenient/look Data.ByteString.Base64.Internal 5.8 2.6
decodeLenientWithTable/fill Data.ByteString.Base64.Internal 2.9 0.1
decodeLenientWithTable.\.\.fill Data.ByteString.Base64.Internal 1.4 0.0
decodeLenientWithTable.\.\.fill.\ Data.ByteString.Base64.Internal 1.4 0.1
decodeLenientWithTable.\.\.fill.\.\.\.\ Data.ByteString.Base64.Internal 1.4 3.3
decodeLenient Data.ByteString.Base64.Lazy 1.4 1.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 443 0 24.6 2.5 100.0 100.0
main Main 887 0 0.0 0.0 75.4 97.4
main.g Main 889 0 60.9 88.8 75.4 97.4
object_ Data.Aeson.Parser.Internal 925 0 0.0 0.0 0.0 0.2
jstring_ Data.Aeson.Parser.Internal 927 50 0.0 0.2 0.0 0.2
unstream/resize Data.Text.Internal.Fusion 923 600 0.0 0.3 0.0 0.3
decodeLenient Data.ByteString.Base64.Lazy 891 0 1.4 1.4 14.5 8.1
decodeLenient Data.ByteString.Base64 897 500 0.0 0.0 13.0 6.7
....
+RTS -p -h and hp2ps show me the following picture and two numbers: 114K in the header and something around 1.8Mb on the graph.
And, just in case, here is the app:
module Main where
import Network.Wreq
import Control.Lens
import Data.Aeson.Lens
import Control.Monad
main :: IO ()
main = replicateM_ 10 g
where
g = do
r <- get "http://httpbin.org/get"
print $ r ^. responseBody
. key "headers"
. key "User-Agent"
. _String
UPDATE 1: Thank everyone for incredible good responses. As was suggested, I add +RTS -s output, so the entire picture builds up for everyone who read it.
#./simple-wreq +RTS -s
...
128,875,432 bytes allocated in the heap
32,414,616 bytes copied during GC
2,394,888 bytes maximum residency (16 sample(s))
355,192 bytes maximum slop
7 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 194 colls, 0 par 0.018s 0.022s 0.0001s 0.0022s
Gen 1 16 colls, 0 par 0.027s 0.031s 0.0019s 0.0042s
UPDATE 2: The size of the executable:
#du -h simple-wreq
63M simple-wreq

A man with a watch knows what time it is. A man with two watches is never sure.
Ah, but what do does two watches show? Are both meant to show the current time in UTC? Or is one of them supposed to show the time in UTC, and the other one the time on a certain point on Mars? As long as they are in sync, the second scenario wouldn't be a problem, right?
And that is exactly what is happening here. You compare different memory measurements:
the maximum residency
the total amount of allocated memory
The maximum residency is the highest amount of memory your program ever uses at a given time. That's 19MB. However, the total amount of allocated memory is a lot more, since that's how GHC works: it "allocates" memory for objects that are garbage collected, which is almost everything that's not unpacked.
Let us inspect a C example for this:
int main() {
int i;
char * mem;
for(i = 0; i < 5; ++i) {
mem = malloc(19 * 1000 * 1000);
free(mem);
}
return 0;
}
Whenever we use malloc, we will allocate 19 megabytes of memory. However, we free the memory immediately after. The highest amount of memory we ever have at one point is therefore 19 megabytes (and a little bit more for the stack and the program itself).
However, in total, we allocate 5 * 19M, 95M total. Still, we could run our little program with just 20 megs of RAM fine. That's the difference between total allocated memory and maximum residency. Note that the residency reported by time is always at least du <executable>, since that has to reside in memory too.
That being said, the easiest way to generate statistics is -s, which will show how what was the maximum residency from the Haskell's program point of view. In your case, it will be the 1.9M, the number in your heap profile (or double the amount due to profiling). And yeah, Haskell executables tend to get extremely large, since libraries are statically linked.

time -l is displaying the (resident, i.e. not swapped out) size of the process as seen by the operating system (obviously). This includes twice the maximum size of the Haskell heap (due to the way that GHC's GC works), plus anything else allocated by the RTS or other C libraries, plus the code of your executable itself plus the libraries it depends on, etc. I'm guessing in this case the primary contributor to the 19M is the size of your exectuable.
total alloc is the total amount allocated onto the Haskell heap. It is not at all a measure of maximum heap size (which is what people usually mean by "how much memory is my program using"). Allocation is very cheap and allocation rates of around 1GB/s are typical for a Haskell program.
The number in the header of the hp2ps output "114,272 bytes x seconds" is something completely different again: it is the integral of the graph, and is measured in bytes * seconds, not in bytes. For example if your program holds onto a 10 MB structure for 4 seconds then that will cause this number to increase by 40 MB*s.
The number around 1.8 MB shown in the graph is the actual maximum size of the Haskell heap, which is probably the number you're most interested in.
You've omitted the most useful source of numbers about your program's execution, which is running it with +RTS -s (this doesn't even require it to have been built with profiling).

Excessive Gen 2 Free Blocks in Crash Dump

On inspecting a crash dump file for an out of memory exception reported by a client the results of !DumpHeap -stat showed that 575MB of memory is being taken up by 45,000 objects of type "Free" most of which I assume would have to reside in Gen 2 due to the size.
The first places I looked for problems were the large object heap (LOH) and pinned objects. The large object heap with free space included was only 70MB so that wasn't the issue and running !gchandles showed:
GC Handle Statistics:
Strong Handles: 155
Pinned Handles: 265
Async Pinned Handles: 8
Ref Count Handles: 163
Weak Long Handles: 0
Weak Short Handles: 0
Other Handles: 0
which is a very small number of handles (around 600) compared to the number of free objects (45,000). To me this rules out the free blocks being caused by pinning.
I also looked into the free blocks themselves to see if maybe they had a consistent size, but on inspection the sizes varied widely and went from just short of 5MB to only around 12 bytes or so.
Any help would be appreciated! I am at a loss since there is fragmentation but no signs of it being cause by the two places that I know to look which are the large object heap (LOH) and pinned handles.

In which generation are the free objects?
I assume would have to reside in Gen 2 due to the size
Size is not related to generations. To find out in which generation the free blocks reside, you can follow these steps:
From !dumpheap -stat -type Free get the method table:
0:003> !dumpheap -stat -type Free
total 7 objects
Statistics:
MT Count TotalSize Class Name
00723538 7 100 Free
Total 7 objects
From !eeheap -gc, get the start addresses of the generations.
0:003> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x026a1018
generation 1 starts at 0x026a100c
generation 2 starts at 0x026a1000
ephemeral segment allocation context: none
segment begin allocated size
026a0000 026a1000 02731ff4 0x00090ff4(593908)
Large object heap starts at 0x036a1000
segment begin allocated size
036a0000 036a1000 036a3250 0x00002250(8784)
Total Size 0x93244(602692)
------------------------------
GC Heap Size 0x93244(602692)
Then dump the free objects only of a specific generation by passing the start and end address (e.g. of generation 2 here):
0:003> !dumpheap -mt 00723538 0x026a1000 0x026a100c
Address MT Size
026a1000 00723538 12 Free
026a100c 00723538 12 Free
total 2 objects
Statistics:
MT Count TotalSize Class Name
00723538 2 24 Free
Total 2 objects
So in my simple case, there are 2 free objects in generation 2.
Pinned handles
600 pinned objects should not cause 45.000 free memory blocks. Still from my experience, 600 pinned handles are a lot. But first, check in which generation the free memory blocks reside.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string