Facing CUDA Error while training MNMT model using fairseq - pytorch

I was trying to use fairseq to train a model for English-Russian,English-French,English-Spanish,English-German data but have been getting a CUDA Error which prevents me from running the model.
I have tried using multiple batch sizes,learning rate but am unable to run .
fairseq-train pre \
--arch transformer_wmt_en_de \
--task translation_multi_simple_epoch \
--encoder-langtok src --decoder-langtok --lang-pairs en-ru,en-fr,en-es,en-de \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' \
--lr-scheduler inverse_sqrt --lr 1e-03 --warmup-updates 4000 --max-update 100000 \
--dropout 0.3 --weight-decay 0.0001 \
--max-tokens 4096 --max-epoch 20 --update-freq 8 \
--save-interval 10 --save-interval-updates 5000 --keep-interval-updates 20 \
--log-format simple --log-interval 100 \
--save-dir checkpoints --validate-interval-updates 5000 \
--fp16 --num-workers 0 --batch-size 64
The above code is what I have used with various different parameters for batch size, learning rate, etc., but all seem to amount to a CUDA Error.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 10.57 GiB (GPU 0; 15.74 GiB total capacity; 5.29 GiB already allocated; 9.50 GiB free; 5.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Any kind of help would be appreciated.

The preferred way of specifying the batch size in FairSeq is via the --max-tokens argument, not --batch-size (not sure what happens if you specify both).
The batches are always padded to the same length, and the sentence lengths might be vastly different. If there is even a single very long sentence in the batch, it means that the entire batch is very large. To avoid this, the --max-tokens argument was introduced. It is set to 4096, meaning the batch size will not exceed 4096 tokens, but the number of sentences in each batch might differ. It is implemented efficiently by sorting the training sentences by their length first, then splitting them into batches, which are then shuffled randomly. This maximizes memory efficiency.
What you should do is:
Remove the --batch-size argument.
Try to decrease the --max-tokens argument.
If it still does not help, use a smaller model.
The learning rate has no effect on memory consumption.

Related

Difference between dask node and compute node for slurm configuration

First off, apologies if I use confusing or incorrect terminology, I am still learning.
I am trying to set up configuration for a Slurm-enabled adaptive cluster.
Documentation of the supercomputer and it’s Slurm configuration is documented here. Here is some of the most relevant information extracted from the website:
Partition Name
Max Nodes per Job
Max Job Runtime
Max resources used simultaneously
Shared Node Usage
Default Memory per CPU
Max Memory per CPU
compute
512
8 hours
no limit
no
1920 MB
8000 MB
compute
This partition consists of 2659 AMD EPYC 7763 Milan compute nodes and is intended for running parallel scientific applications. The compute nodes allocated for a job are used exclusively and cannot be shared with other jobs. Some information about the compute node:
Component
Value
# of CPU Cores
64
# of Threads
128
Here is some output from control show partition:
PartitionName=compute
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=512 MaxTime=08:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=l[10000-10058,10061-10062,10064-10065,10067-10068,10070-10083,10090-10095,10100-10158,10160-10183,10190-10195,10200-10258,10260-10283,10290-10295,10300-10357,10359-10383,10390-10395,10400-10483,10490-10495,10500-10583,10590-10595,10600-10683,10690-10695,10700-10783,10790-10795,20000-20059,20061-20062,20064-20065,20067-20068,20070-20083,20090-20095,20100-20183,20190-20195,20200-20223,20225-20283,20290-20295,20300-20383,20390-20395,20400-20483,20490-20495,20500-20583,20590-20595,20600-20683,20690-20695,30000-30059,30061-30062,30064-30083,30090-30095,30100-30183,30190-30195,30200-30230,30232-30283,30290-30295,30300-30383,30390-30395,30400-30483,30490-30495,30500-30583,30590-30595,30600-30683,30690-30695,30700-30760,30762-30783,30790-30795,40000-40026,40028-40029,40031-40032,40034-40035,40037-40038,40040-40083,40090-40095,40101-40102,40104-40105,40107-40108,40110-40111,40113-40183,40190-40195,40200-40283,40287-40295,40300-40359,40400-40483,40490-40495,40500-40583,40587-40595,40600-40683,40687-40695,50200-50259,50269-50271,50300-50359,50369-50371]
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=EXCLUSIVE
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=711168 TotalNodes=2778 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerCPU=960 MaxMemPerCPU=3840
Here is what I have so far:
cluster = SLURMCluster(
name='dask-cluster',
processes=32,
cores=64,
memory=f"{8000 * 64 * 0.90} MB",
project="ab0995",
queue="compute",
interface='ib0',
walltime='08:00:00',
asynchronous=0,
# job_extra=["--ntasks-per-node=50",],
)
Some things to mention:
In the first above table, “nodes” refers to compute server nodes, not Dask nodes (which I think should probably be rather called Dask Workers? If someone could clear up that term for me I would be grateful). Since I have 64 CPU Cores and 8000 MB of allowed memory, I thought it would be sensible to set the memory to 8000 * 64 with a “reduction” factor of 0.90, just to be on the safe side.
I have 64 CPUs, which I believe should translate to 64 “cores” in the SLURMCluster. I want each Python to have 2 CPUs, so, in total 32 processes. That might be optimised down to 4 CPUs per Python, but I have no idea how to get a feeling for sensible settings here.
I set the walltime of each dask-cluster job to the maximum allowed; as I would rather block with one Slurm Job than need to wait. This might induce idle work of that server, but it might still be more effective than waiting in the Slurm batch system queue.
If I now print the job script as configured above, I get:
print(cluster.job_script())
#!/usr/bin/env bash
#SBATCH -J dask-worker
#SBATCH -p compute
#SBATCH -A ab0995
#SBATCH -n 1
#SBATCH --cpus-per-task=64
#SBATCH --mem=430G
#SBATCH -t 08:00:00
/work/ab0995/AWIsoft/miniconda/NextGEMS/.conda/bin/python -m distributed.cli.dask_worker tcp://136.172.120.121:36449 --nthreads 2 --nprocs 32 --memory-limit 13.41GiB --name dummy-name --nanny --death-timeout 60 --interface ib0 --protocol tcp://
So, questions:
By my mental math, 8000*64*0.9 = 460.8 GB, not 430G What is happening here?
I don’t really understand the nthreads. nprocs, and memory-limit getting of the dask_worker…?
Can someone give me a good distinction between dask nodes, dask workers, compute nodes as seen in slurm?
I am in a similar situation to yours, where I am trying to understand how dask-ditributed works with Slurms.
The dask-distributed docs report that Slurm uses KB or GB, but it actually means KiB or GiB, so Dask converts your value in GiB.
What I've found is that nprocs=processes, nthreads=cores/processes and memory-limit=allocated-memory/processes. The job will then launch a dask_worker using nprocs processes of nthreads each (which will be the Workers of your SLURMCluster).
This is not clear to me as well, so I don't have a good answer. I think that since Slurm nodes have several CPUs, the Scheduler of your SLURMCluster manages Dask Workers based on the allocated CPUs. (I didn't find anything about dask nodes from the docs though)

PySpark PandasUDF on GCP - Memory Allocation

I am using a pandas udf to train many ML models on GCP in Dataproc (Spark). The main idea is that I have a grouping variable that represents the various sets of data in my data frame and I run something like this:
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def test_train(grp_df):
#train model on grp_df
#evaluate model
#return metrics on
return (metrics)
result=df.groupBy('group_id').apply(test_train)
This works fine except when I use the non-sampled data, where errors are returned that appear to be related to memory issues. The messages are cryptic (to me) but if I sample down the data it runs, if I dont, it fails. Error messages are things like:
OSError: Read out of bounds (offset = 631044336, size = 69873416) in
file of size 573373864
or
Container killed by YARN for exceeding memory limits. 24.5 GB of 24
GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
My Question is how to set memory in the cluster to get this to work?
I understand that each group of data and the process being ran needs to fit entirely in the memory of the executor. I current have a 4-worker cluster with the following:
If I think the maximum size of data in the largest group_id requires 150GB of memory, it seems I really need each machine to operate on one group_id at a time. At least I get 4 times the speed compared to having a single worker or VM.
If I do the following, is this in fact creating 1 executor per machine that has access to all the cores minus 1 and 180 GB of memory? So that if in theory the largest group of data would work on a single VM with this much RAM, this process should work?
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '180g') \
.config('spark.executor.cores', '63') \
.config('spark.executor.instances', '1') \
.getOrCreate()
Let's break the answer into 3 parts:
Number of executors
The GroupBy operation
Your executor memory
Number of executors
Straight from the Spark docs:
spark.executor.instances
Initial number of executors to run if dynamic allocation is enabled.
If `--num-executors` (or `spark.executor.instances`) is set and larger
than this value, it will be used as the initial number of executors.
So, No. You only get a single executor which won't scale up unless dynamic allocation is enabled.
You can increase the number of such executors manually by configuring spark.executor.instances or setup automatic scale up based on workload, by enabling dynamic executor allocation.
To enable dynamic allocation, you have to also enable the shuffle service which allows you to safely remove executors. This can be done by setting two configs:
spark.shuffle.service.enabled to true. Default is false.
spark.dynamicAllocation.enabled to true. Default is false.
GroupBy
I have observed group_by being done using hash aggregates in Spark which means given x number of partitions, and unique group_by values greater than x, multiple group by values will lie in the same partition.
For example, say two unique values in group_by column are a1 and a2 having total rows' size 100GiB and 150GiB respectively.
If they fall into separate partitions, your application will run fine since each partition will fit into the executor memory (180GiB), which is required for in-memory processing and the remaining will be spilled to disk if they do not fit into the remaining memory. However, if they fall into same partition, your partition will not fit into the executor memory (180GiB < 250GiB) and you will get an OOM.
In such instances, it's useful to configure spark.default.parallelism to distribute your data over a reasonably larger number of partitions or apply salting or other techniques to remove data skewness.
If your data is not too skewed, you are correct to say that as long as your executor can handle the largest groupby value, it should work since your data will be evenly partitioned and chances of the above happening will be rare.
Another point to note is that since you are using group_by which requires data shuffle, you should also turn on the shuffle service. Without the shuffle service, each executor has to serve the shuffle requests along with doing it's own work.
Executor memory
The total executor memory (actual executor container size) in Spark is determined by adding the executor memory alloted for container along with the alloted memoryOverhead. The memoryOverhead accounts for things like VM overheads, interned strings, other native overheads, etc. So,
Total executor memory = (spark.executor.memory + spark.executor.memoryOverhead)
spark.executor.memoryOverhead = max(executorMemory*0.10, 384 MiB)
Based on this, you can configure your executors to have an appropriate size as per your data.
So, when you set the spark.executor.memory to 180GiB, the actual executor launched should be of around 198GiB.
To Resolve yarn overhead issue you can increase yarn overhead memory by adding .config('spark.yarn.executor.memoryOverhead','30g') and for maximum parallelism it is recommended to keep no of cores to 5 where as you can increase the no of executors.
spark = SparkSession.builder \
.appName('test') \
.config('spark.executor.memory', '18g') \
.config('spark.executor.cores', '5') \
.config('spark.executor.instances', '12') \
.getOrCreate()
# or use dynamic resource allocation refer below config
spark = SparkSession.builder \
.appName('test') \
.config('spark.shuffle.service.enabled':'true')\
.config('spark.dynamicAllocation.enabled':'true')\
.getOrCreate()
I solved OSError: Read out of bounds ****
by making group number large
result=df.groupBy('group_id').apply(test_train)

Spark - No disk space left on Topic modelling

I am running Jupyter notebook on a system with 64gb RAM, 32 cores and 500GB disk space.
Around 700k documents are to be modeled into 600 topics. The vocabulary size is 48000 words. 100 iterations were used.
spark = SparkSession.builder.appName('LDA').master("local[*]").config("spark.local.dir", "/data/Data/allYears/tempAll").config("spark.driver.memory","50g").config("spark.executor.memory","50g").getOrCreate()
dataset = spark.read.format("libsvm").load("libsm_file.txt")
lda = LDA(k=600, maxIter=100 , optimizer='em' , seed=2 )
lda.setDocConcentration([1.01])
lda.setTopicConcentration(1.001)
model = lda.fit(dataset)
Disk quota exceeded error comes after 10 hours of run
You mentioned that the error message you encountered indicated that the disk quota has been exceeded. I suspect that Spark is shuffling data to disk and that disk is out of space.
To mitigate this, you should try explicitly passing --conf spark.local.dir=<path to disk with space> to a location with sufficient space. This parameter specifies what path Spark will use to write temporary data to disk (e.g. when writing shuffle data between stages of your job). Even if your input and output data are not particularly large, certain algorithms can generate a very large amount of shuffle data.
You could also consider monitoring the allocated/free space of this path using du while running your job to get more information on how much intermediate data is being written. This would confirm that a high amount of shuffle data exhausting the available disk space is the issue.

Any tips for scaling Spark horizontally

Does anybody have any tips when moving Spark execution from a few large nodes to many, smaller nodes?
I am running a system with 4 executors, each executor has 24Gb of ram and 12 cores. If I try to scale that out to 12 executors, 4 cores each and 8 Gb of ram (Same total RAM, same total cores, just distributed differently) I run into out of memory errors:
Container killed by YARN for exceeding memory limits. 8.8 GB of 8.8 GB physical memory used.
I have increased the number partitions by a factor of 3 to create more (yet smaller) partitions, but this didn't help.
Does anybody have any tips & tricks when trying to scale spark horizontally?
This is a pretty broad question, executor sizing in Spark is a very complicated kind of black magic, and the rules of thumb which were correct in 2015 for example are obsolete now, as will whatever I say be obsolete in 6 months with the next release of Spark. A lot comes down to exactly what you are doing and avoiding key skew in your data.
This is a good place to start to learn and develop your own understanding:
https://spark.apache.org/docs/latest/tuning.html
There are also a multitude of presentations on Slideshare about tuning Spark, try and read / watch the most recent ones. Anything older than 18 months be sceptical of, and anything older than 2 years just ignore.
I will make the assumption that you are using at least Spark 2.x.
The error you're encountering is indeed because of poor executor sizing. What is happening is that your executors are attempting to do too much at once, and running themselves into the ground as they run out of memory.
All other things being equal these are the current rules of thumb as I apply them:
The short version
3 - 4 virtual (hyperthreaded) cores and 29GB of RAM is a reasonable default executor size (I will explain why later). If you know nothing else, partition your data well and use that.
You should normally aim for a data partition size (in memory) on the order of ~100MB to ~3GB
The formulae I apply
Executor memory = number of executor cores * partition size * 1.3 (safety factor)
Partition size = size on disk of data / number of partitions * deser ratio
The deserialisation ratio is the ratio between the size of the data on disk and the size of data in memory. The Java memory representation of the same data tends to be a decent bit larger than on disk.
You also need to account for whether your data is compressed, many common formats like Parquet and ORC use compression like gzip or snappy.
For snappy compressed text data (very easily compressed), I use ~10X - 100X.
For snappy compressed data with a mix of text, floats, dates etc I see between 3X and 15X typically.
number of executor cores = 3 to 4
Executor cores totally depends on how compute vs memory intensive your calculation is. Experiment and see what is best for your use case. I have never seen anyone informed on Spark advocate more than 6 cores.
Spark is smart enough to take advantage of data locality, so the larger your executor, the better chance that your data is PROCESS_LOCAL
More data locality is good, up to a point.
When a JVM gets too large > 50GB, it begins to operate outside what it was originally designed to do, and depending on your garbage collection algorithm, you may begin to see degraded performance and high GC time.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
There also happens to be a performance trick in Java that if your JVM is smaller than 32GB, you can use 32 bit compressed pointers rather than 64 bit pointers, which saves space and reduces cache pressure.
https://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
It also so happens that YARN adds 7% or 384MB of RAM (whichever is larger) to your executor size for overhead / safety factor, which is where 29GB rule of thumb comes from: 29GB + 7% ~= 32GB
You mentioned that you are using 12 core, 24GB RAM executors. This sends up a red flags for me.
Why?
Because every "core" in an executor is assigned one "task" at time. A task is equivalent to the work required to compute the transformation of one partition from "stage" A to "stage" B.
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-taskscheduler-tasks.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-DAGScheduler-Stage.html
If your executor has 12 cores, then it is going to try and do 12 tasks simulatenously with a 24GB memory budget. 24GB / 12 cores = 2GB per core. If your partitions are greater than 2GB, you will get an out of memory error. If the particular transformation doubles the size of the input (even intermediately), then you need to account for that as well.

Spark: executor memory exceeds physical limit

My input dataset is about 150G.
I am setting
--conf spark.cores.max=100
--conf spark.executor.instances=20
--conf spark.executor.memory=8G
--conf spark.executor.cores=5
--conf spark.driver.memory=4G
but since data is not evenly distributed across executors, I kept getting
Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used
here are my questions:
1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer
2. I think about repartition the input dataFrame, so how can I determine how many partition to set? the higher the better, or?
3. The error says "9 GB physical memory used", but i only set 8G to executor memory, where does the extra 1G come from?
Thank you!
When using yarn, there is another setting that figures into how big to make the yarn container request for your executors:
spark.yarn.executor.memoryOverhead
It defaults to 0.1 * your executor memory setting. It defines how much extra overhead memory to ask for in addition to what you specify as your executor memory. Try increasing this number first.
Also, a yarn container won't give you memory of an arbitrary size. It will only return containers allocated with a memory size that is a multiple of it's minimum allocation size, which is controlled by this setting:
yarn.scheduler.minimum-allocation-mb
Setting that to a smaller number will reduce the risk of you "overshooting" the amount you asked for.
I also typically set the below key to a value larger than my desired container size to ensure that the spark request is controlling how big my executors are, instead of yarn stomping on them. This is the maximum container size yarn will give out.
nodemanager.resource.memory-mb
The 9GB is composed of the 8GB executor memory which you add as a parameter, spark.yarn.executor.memoryOverhead which is set to .1, so the total memory of the container is spark.yarn.executor.memoryOverhead + (spark.yarn.executor.memoryOverhead * spark.yarn.executor.memoryOverhead) which is 8GB + (.1 * 8GB) ≈ 9GB.
You could run the entire process using a single executor, but this would take ages. To understand this you need to know the notion of partitions and tasks. The number of partition is defined by your input and the actions. For example, if you read a 150gb csv from hdfs and your hdfs blocksize is 128mb, you will end up with 150 * 1024 / 128 = 1200 partitions, which maps directly to 1200 tasks in the Spark UI.
Every single tasks will be picked up by an executor. You don't need to hold all the 150gb in memory ever. For example, when you have a single executor, you obviously won't benefit from the parallel capabilities of Spark, but it will just start at the first task, process the data, and save it back to the dfs, and start working on the next task.
What you should check:
How big are the input partitions? Is the input file splittable at all? If a single executor has to load a massive amount of memory, it will run out of memory for sure.
What kind of actions are you performing? For example, if you do a join with very low cardinality, you end up with a massive partitions because all the rows with a specific value, end up in the same partitions.
Very expensive or inefficient actions performed? Any cartesian product etc.
Hope this helps. Happy sparking!

Resources