Training model with CreateML MLTextClassifier, stopped by EXC_BAD_ACCESS (code=1, address=0x0) - nlp

I'm trying to train my own NLP model with CreateML with Xcode playground, and going through the tutorial by Apple: https://developer.apple.com/documentation/createml/creating_a_text_classifier_model
but the program terminated by EXC_BAD_ACCESS (code=1, address=0x0)
I found some solution from the Internet, they stated that the pointer is pointing to NULL when trying to access the variable
import Foundation
import CreateML
let source = "icecream"
let data = try MLDataTable(contentsOf: URL(fileURLWithPath: "/path/to/\(source).csv"))
let (trainingData, testingData) = data.randomSplit(by: 0.8, seed: 0)
// program stopped here
let sentimentClassifier = try MLTextClassifier(trainingData: trainingData, textColumn: "text", labelColumn: "sentiment")
// error
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0x0).
// output
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 100 lines in 0.03412 secs.
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 188 lines in 0.008235 secs.
Automatically generating validation set from 10% of the data.
Tokenizing data and extracting features
Starting MaxEnt training with 146 samples
Iteration 1 training accuracy 0.650685
Iteration 2 training accuracy 0.869863
Iteration 3 training accuracy 0.945205
Iteration 4 training accuracy 0.986301
Iteration 5 training accuracy 0.993151
Finished MaxEnt training in 0.04 seconds

Related

Clarifications on training job parameters with Tensorflow

Im using the new Tensorflow object detection API.
I need to replicate training parameters used on a paper but Im a bit confused.
In the paper is stated
When training neural network models, their base confguration is similar to that used to
train on the COCO 2017 dataset. For the unambiguous comparison of the selected models, the total number of
training steps was set to 100 equal to 100′000 iterations of learning.
Inside model_main_tf2.py, which is the script used to start the training, I can read the following:
"""Creates and runs TF2 object detection models.
For local training/evaluation run:
PIPELINE_CONFIG_PATH=path/to/pipeline.config
MODEL_DIR=/tmp/model_outputs
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--pipeline_config_path=$PIPELINE_CONFIG_PATH \
--alsologtostderr
"""
Also, you can specify the num_steps and total_steps parameters in the pipeline.config file (used by the training script):
train_config: {
batch_size: 1
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 50000
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .16
total_steps: 50000
warmup_learning_rate: 0
warmup_steps: 2500
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
So, what Im not understanding is how should I map what is written in the paper with tensorflow parameters.
What is the num steps and total_steps inside the pipeline.config file?
What is the NUM_TRAIN_STEPS argument instead?
Does it overwrite config file steps or its a completely different thing?
If more details are needed feel free to ask.

Natural Nodejs NLP classifier gives heap dump when training more than 2000 data points

We are using natural nodejs library to classify our users queries in the following way:
const natural = require("natural");
const cls = new natural.LogisticRegressionClassifier();
cls.addDocument("book", "stationary");
cls.addDocument("pen", "stationary");
...
.....
....
5000 + data points
cls.addDocument("last one", "last one");
cls.train(); ----- But This gives heap error and the program crashes.
pino.info("StockNames training successfully completed.");
The train() functions works fine when dataset size is under few hundreds but throwing heap errors and crashing when dataset size is in few thousands. Any suggestions please help. Thanks

Torch.cuda.empty_cache() very very slow performance

I have a very slow performance problem when I execute an inference batch loop on a single GPU.
This slow behavior appears after the first batch has been processed -
that is when the GPU is already almost full and its memory needs to be recycled to accept the next batch.
At a pristine GPU state - the performance is super fast (as expected).
I hope both the following code snippet and the output illustrate the problem in a nutshell.
(I've removed the print and time measurements from the snippet for brevity)
predictions = None
for i, batch in enumerate(self.test_dataloader):
# if this line is active - the bottleneck after the first batch moves here, rather than below
# i.e. when i > 0
# torch.cuda.empty_cache()
# HUGE PERFORMANCE HIT HAPPENS HERE - after the first batch
# i.e. when i > 0
# obviously tensor.to(device) uses torch.cuda.empty_cache() internally when needed
# and it is inexplicably SLOW
batch = tuple(t.to(device) for t in batch) # to GPU (or CPU) when gpu
b_input_ids, b_input_mask, b_labels = batch
with torch.no_grad():
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
logits = outputs[0]
logits = logits.detach()
# that doesn't help alleviate the issue
del outputs
predictions = logits if predictions is None else torch.cat((predictions, logits), 0)
# nor do all of the below - freeing references doesn't help speeding up
del logits
del b_input_ids
del b_input_mask
del b_labels
for o in batch:
del o
del batch
output
start empty cache... 0.00082
end empty cache... 1.9e-05
start to device... 3e-06
end to device... 0.001179 - HERE - time is super fast (as expected)
start outputs... 8e-06
end outputs... 0.334536
logits... 6e-06
start detach... 1.7e-05
end detach... 0.004036
start empty cache... 0.335932
end empty cache... 4e-06
start to device... 3e-06
end to device... 16.553849 - HERE - time is ridiculously high - it's 16 seconds to move tensor to GPU
start outputs... 2.3e-05
end outputs... 0.020878
logits... 7e-06
start detach... 1.4e-05
end detach... 0.00036
start empty cache... 0.00082
end empty cache... 6e-06
start to device... 4e-06
end to device... 17.385204 - HERE - time is ridiculously high
start outputs... 2.9e-05
end outputs... 0.021351
logits... 4e-06
start detach... 1.3e-05
end detach... 1.1e-05
...
Have I missed something obvious or is this the expected GPU behavior?
I am posting this question before engaging in complex coding, juggling between a couple of GPUs and CPU available on my server.
Thanks in advance,
Albert
EDIT
RESOLVED The issue was: in DataLoader constructor - I changed the pin_memory to False (True was causing the issue). That cut the .to(device) time by 350%-400%
self.test_dataloader = DataLoader(
test_dataset,
sampler=SequentialSampler(test_dataset),
# batch_size=len(test_dataset) # AKA - single batch - nope! no mem for that
batch_size=BATCH_SIZE_AKA_MAX_ROWS_PER_GUESS_TO_FIT_GPU_MEM,
# tests
num_workers=8,
# maybe this is the culprit as suggested by user12750353 in stackoverflow
# pin_memory=True
pin_memory=False
)
You should not be required to clear cache if you are properly clearing the references to the previously allocated variables. Cache is like free, is memory that your script can use for new variables.
Also notice that
a = torch.zeros(10**9, dtype=torch.float)
a = torch.zeros(10**9, dtype=torch.float)
Requires 8GB of memory, even though a uses 4GB (1B elements with 4 bytes each). This occurs because the torch.zeros will allocate memory before the previous content of a is released. This may be happening in your model in a larger scale, depending on how it is implemented.
Edit 1
One suspicious thing is that you are loading your batch to the GPU one example at a time.
Just to illustrate what I mean
import torch
device = 'cuda'
batch = torch.zeros((4500, 10));
Creating the batch as a tuple
batch_gpu = tuple(t.to(device) for t in batch)
torch.cuda.synchronize()
254 ms ± 36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Creating the batch as a list
batch_gpu = list(t.to(device) for t in batch)
torch.cuda.synchronize()
235 ms ± 3.74 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
batch_gpu = batch.to(device)
torch.cuda.synchronize()
115 µs ± 2.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In this example it was 2000x faster to copy one example at a time.
Notice that GPU works asynchronously with the CPU. So you may keep calling functions that will return before the operation is finished. In order to make meaningful measurements you may call synchronize to make clear the time boundaries.
The code to be instrumented is this
for i, batch in enumerate(self.test_dataloader):
# torch.cuda.empty_cache()
# torch.synchronize() # if empty_cache is used
# start timer for copy
batch = tuple(t.to(device) for t in batch) # to GPU (or CPU) when gpu
torch.cuda.synchronize()
# stop timer for copy
b_input_ids, b_input_mask, b_labels = batch
# start timer for inference
with torch.no_grad():
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
torch.cuda.synchronize()
# stop timer for inference
logits = outputs[0]
logits = logits.detach()
# if you copy outputs to CPU it will be synchronized

How to measure ONLY the inference time in the GPU, using TensorRT and PyCUDA?

I want to measure ONLY the inference time in the Jetson TX2. How can I improve my function to do that? As right now I am measuring:
the transfer of the image from CPU to GPU
transfer of results from GPU to CPU
the inference
Or is that not possible because of the way GPUs work? I mean, how many times will I have to use stream.synchronize() if I divide/segment the function into 3 parts:
transfer from CPU to GPU
Inference
transfer from GPU to CPU
Thank you
CODE IN INFERENCE.PY
def do_inference(engine, pics_1, h_input, d_input, h_output, d_output, stream, batch_size):
"""
This is the function to run the inference
Args:
engine : Path to the TensorRT engine.
pics_1 : Input images to the model.
h_input: Input in the host (CPU).
d_input: Input in the device (GPU).
h_output: Output in the host (CPU).
d_output: Output in the device (GPU).
stream: CUDA stream.
batch_size : Batch size for execution time.
height: Height of the output image.
width: Width of the output image.
Output:
The list of output images.
"""
# Context for executing inference using ICudaEngine
with engine.create_execution_context() as context:
# Transfer input data from CPU to GPU.
cuda.memcpy_htod_async(d_input, h_input, stream)
# Run inference.
#context.profiler = trt.Profiler() ##shows execution time(ms) of each layer
context.execute(batch_size=1, bindings=[int(d_input), int(d_output)])
# Transfer predictions back from the GPU to the CPU.
cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the stream.
stream.synchronize()
# Return the host output.
out = h_output
return out
CODE IN TIMER.PY
for i in range (count):
start = time.perf_counter()
# Classification - calling TX2_classify.py
out = eng.do_inference(engine, image, h_input, d_input, h_output, d_output, stream, 1)
inference_time = time.perf_counter() - start
print("TIME")
print(inference_time * 1000)
print("\n")
pred = postprocess_inception(out)
print(pred)
print("\n")

JAGS Beginner - Receiving and Understanding Output

When using JAGS, how does one receive output from a model in the format:
Inference for Bugs model at "model.txt", fit using jags,
3 chains, each with 10000 iterations (first 5000 discarded)
n.sims = 15000 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
mu 9.950 0.288 9.390 9.755 9.951 10.146 10.505 1.001 11000
sd.obs 3.545 0.228 3.170 3.401 3.534 3.675 3.978 1.001 13000
deviance 820.611 3.460 818.595 819.132 819.961 821.366 825.871 1.001 15000
I assumed, as with BUGS, it would appear when the model completes however I only get something in the format:
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 1785
Unobserved stochastic nodes: 1843
Total graph size: 61542
Initializing model
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
Apologies for the basic question. If anyone can provide useful JAGS introductory material that would also be useful.
Kind regards.
If you only get the 'plus' signs, it means you only initialized the model. When jags really runs, it typically produces '***' signs after. So you are missing a line here (would have been nice to see your code). For instance if you use r2jags, you would write:
out <- jags(data = data, parameters.to.save = params, n.chains = 3, n.iter = 90000,n.burnin = 5000,
model.file = modFile)
out.upd <- update(abundance.out.mod, n.iter=10000)

Resources