What does "promotion failure" in a JVM GC log mean? - garbage-collection

2020-06-17T12:54:16.995+0800: 681976.777: [GC (Allocation Failure) 2020-06-17T12:54:16.995+0800: 681976.777: [ParNew (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3) (promotion failed): 1823484K->1883608K(1887488K), 0.4255307 secs]
2020-06-17T12:54:17.421+0800: 681977.202: [CMS: 952879K->532413K(2097152K), 2.9620031 secs] 2776364K->532413K(3984640K), [Metaspace: 211696K->211696K(1241088K)], 3.3881912 secs] [Times: user=3.23 sys=0.65, real=3.39 secs]
My question is , what does this log mean ?
(0: promotion failure size = 2) (1: promotion failure size = 3) (2: promotion failure size = 17912169) (3: promotion failure size = 3)

You didn't say which JDK/JVM you use and what garbage collection but based on what I've found it's CMS which has been deprecated a long time ago and removed in JDK 14: https://openjdk.java.net/jeps/363
You should seriously consider switching to more modern JDK / GC.
That's said, it seems that the GC failed to "promote" objects from Youn generation to Old generation due to insufficient space (memory) in the Old gen or its fragmentation.
Here are some links where they discuss this issue:
Java GC Promotion Failures
Avoiding promotion failed in Java CMS GC
https://serverfault.com/questions/698858/gc-taking-big-pause-and-parnew-promotion-failed
https://blogs.oracle.com/poonam/troubleshooting-long-gc-pauses

Related

NetworkX find_cliques error using PySpark

I'm trying to calculate find_cliques functionality to locate the maximal cliques for each subgroup.
I'm using this implementation using pandas_udf grouped by each connected component.
def pd_create_subgroups(pdf):
index = pdf.component.unique()[0]
try:
# building the graph
gnx = nx.from_pandas_edgelist(pdf, "src", "dst")
bic = list(find_cliques(gnx))
if len(bic) <= 1:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
bic_sorted = sorted(map(sorted, bic))
bic_sorted = [b for b in bic_sorted if len(b) >= 3]
if len(bic_sorted) == 0:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
return pd.DataFrame([bic_sorted]).transpose().rename(columns={0: "cliques"})
except:
return pd.DataFrame(data={"cliques": [[f"issue_{index}"]]})
pdf is a pandas dataframe containing the fields src, dst, component
it has around 200M-300M undirected edges
and returns the following error -
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 12.0 failed 4 times, most recent failure: Lost task 0.3 in stage 12.0 (TID 331) (executor 9): java.lang.IndexOutOfBoundsException: index: 2147483628, length: 36 (expected: range(0, 2147483648))
When running on smaller graphs it works properly.

PyTorch Pre-Allocation to avoid OOM does not work

So, I am trying to finetune FCoref using the trainer in https://github.com/shon-otmazgin/fastcoref
This uses a Dynamic Batching with variable length and this creates an issue on CUDA because once PyTorch allocates memory for the first batch, it does not increase it.
So, following this guide here: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#pre-allocate-memory-in-case-of-variable-input-length
I added this to my code and I call it before running the actual training (right after creating the model and moving it to CUDA):
batch = {
"input_ids": torch.rand(9, 5, 512),
"attention_mask": torch.rand(9, 5, 512),
"gold_clusters": torch.rand(9, 58, 39, 2),
"leftovers": {
"input_ids": torch.rand(4),
"attention_mask": torch.rand(4),
}
}
batch['input_ids'] = torch.tensor(batch['input_ids'], device=self.device)
batch['attention_mask'] = torch.tensor(batch['attention_mask'],
device=self.device)
batch['gold_clusters'] = torch.tensor(batch['gold_clusters'],
device=self.device)
if 'leftovers' in batch:
batch['leftovers']['input_ids'] = torch.tensor(
batch['leftovers']['input_ids'], device=self.device)
batch['leftovers']['attention_mask'] = torch.tensor(
batch['leftovers']['attention_mask'],
device=self.device)
self.model.zero_grad()
self.model.train()
with torch.cuda.amp.autocast():
outputs = self.model(batch, gold_clusters=batch['gold_clusters'],
return_all_outputs=False)
loss = outputs[0] # model outputs are always tuple in transformers (see doc)
loss.backward()
At first, I was getting OOM issues with this because it was too big (I basically created the biggest tensors in each key according to my dataset).
So, instead, I created a batch that looks like my biggest batch in the actual data (according to the sum of tensor sizes):
batch = {
"input_ids": torch.rand(4, 1, 512),
"attention_mask": torch.rand(4, 1, 512),
"gold_clusters": torch.rand(4, 11, 24, 2),
"leftovers": {
"input_ids": torch.rand(4, 459),
"attention_mask": torch.rand(4, 459),
}
}
Now, this works but when the actual training starts, I run into the same issue even though the first batch is smaller than the pre-allocation batch:
OutOfMemoryError: CUDA out of memory. Tried to allocate 74.00 MiB (GPU 0; 14.56 GiB total capacity; 13.31 GiB already allocated; 36.44 MiB free; 13.76 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Other things I tried:
export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:21'
Decreasing batch size, but due to the variability I keep running into the same issue.
My machine:
runs Debian 4.19.260-1 (2022-09-29) x86_64 GNU/Linux
T4 GPU with 16 GB VRAM
Any idea?

RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device

My code is here
with cp.cuda.Device(cp_device):
candidates = cp.zeros((point_idxs.shape[0], num_rots, 3), cp.float32)
block_size = (point_idxs.shape[0] + 512 - 1) // 512
rot_voting_kernel(
(block_size, 1, 1),
(512, 1, 1),
(
cp.asarray(pc), cp.asarray(preds_tr[0].cpu().numpy()), cp.asarray(direction[0].cpu().numpy()), candidates, cp.asarray(point_idxs).astype(cp.int32), cp.asarray(corners[0]).astype(cp.float32), cp.float32(cfg.res),
point_idxs.shape[0], num_rots, grid_obj.shape[0], grid_obj.shape[1], grid_obj.shape[2]
)
)
sph_cp = torch.tensor(sphere_pts.T, dtype=torch.float32).cuda()
start = np.arange(0, point_idxs.shape[0] * num_rots, num_rots)
np.random.shuffle(start)
sub_sample_idx = (start[:10000, None] + np.arange(num_rots)[None]).reshape(-1)
candidates = torch.as_tensor(candidates, device='cuda').reshape(-1, 3)
candidates = candidates[torch.LongTensor(sub_sample_idx).cuda()]
cos = candidates.mm(sph_cp)
counts = torch.sum(cos > np.cos(angle_tol / 180 * np.pi), 0).cpu().numpy()
best_dir = np.array(sphere_pts[np.argmax(counts)])
when i run the code in wsl, i got this traceback:
Traceback (most recent call last):
File "nocs/inference.py", line 298, in <module>
candidates = torch.as_tensor(candidates, device='cuda').reshape(-1, 3)
RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.
how to solve the question?
i tried google but didn't get any idea.

RuntimeError: size mismatch, m1: [192 x 68], m2: [1024 x 68] at /opt/conda/conda-bld/pytorch_/work/aten/src/THC/generic/THCTensorMathBlas.cu:268

I'm getting a size mismatch error that I can't understand.
(Pdb) self.W_di
Linear(in_features=68, out_features=1024, bias=True)
(Pdb) indices.size()
torch.Size([32, 6, 68])
(Pdb) self.W_di(indices)
*** RuntimeError: size mismatch, m1: [192 x 68], m2: [1024 x 68] at /opt/conda/conda-bld/pytorch_1556653099582/work/aten/src/THC/generic/THCTensorMathBlas.cu:268
Why is there a mismatch?
Maybe because of the way I defined the weight in forward (instead of _init_)?
This is how I defined self.W_di:
def forward(self):
if self.W_di is None:
self.W_di_weight = nn.Parameter(torch.randn(mL_n * 2,1024).to(device))
self.W_di_bias = nn.Parameter(torch.ones(1024).to(device))
self.W_di = nn.Linear(mL_n * 2, 1024)
self.W_di.weight = self.W_di_weight
self.W_di.bias = self.W_di_bias
result = self.W_di(indices)
Any pointer would be highly appreciated!
Check my answer in here in general you may set
self.W_di = nn.Linear(mL_n * 2, 68)
Or increase the in features.
generally we also face this error in cnn, when your input image is not resized to expected size of the model.

Weird error when selecting more than 100 spark udf columns

Starting with a simple spark dataframe with only one value, I create N simple udf columns.
N = 100
df = sqlContext.createDataFrame([{'value': 0}])
udf_columns = [pyspark.sql.functions.udf(lambda x: 0)('value') for _ in range(N)]
df.select(udf_columns).take(1)
For N <= 100 this code works perfectly.
But as soon as N >= 101, I found the following error
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 34.0 failed 1 times, most recent failure: Lost task 0.0 in stage 34.0 (TID 50, localhost): java.lang.UnsupportedOperationException: Cannot evaluate expression: PythonUDF#<lambda>(input[0, LongType])
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.genCode(Expression.scala:239)
at org.apache.spark.sql.execution.PythonUDF.genCode(python.scala:44)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$gen$2.apply(Expression.scala:104)

Resources