pytorch lightning isn't start fit at local environment - pytorch

I'm study NLP by pytorch.
I have some problem in my local environment.
Same code in Google Colab is work well, but same code in my local isn't work.
This is the code that doesn't work, and output.
I can't find the same problem.
Code using DRAM fully but doesn't show OOM error.
trainer.fit(
task,
train_dataloaders= train_dataloader,
val_dataloaders= val_dataloader
)
Error Message:
Missing logger folder: d:\lightning_logs
C:\Users\Leecarry\AppData\Roaming\Python\Python310\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:608: UserWarning: Checkpoint directory D:\ exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
C:\Users\Leecarry\AppData\Roaming\Python\Python310\site-packages\pytorch_lightning\core\optimizer.py:380: RuntimeWarning: Found unsupported keys in the optimizer configuration: {'scheduler'}
rank_zero_warn(
| Name | Type | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 108 M
--------------------------------------------------------
108 M Trainable params
0 Non-trainable params
108 M Total params
435.680 Total estimated model params size (MB)
Epoch 0: 0%| | 0/6251 [00:00<?, ?it/s]
This image is same code in Google Colab.
Please tell me the reason why fit isn't start, and ideas.
I'm so sorry for my English ability.
Thank you for reading.

Related

Clarifications on training job parameters with Tensorflow

Im using the new Tensorflow object detection API.
I need to replicate training parameters used on a paper but Im a bit confused.
In the paper is stated
When training neural network models, their base confguration is similar to that used to
train on the COCO 2017 dataset. For the unambiguous comparison of the selected models, the total number of
training steps was set to 100 equal to 100′000 iterations of learning.
Inside model_main_tf2.py, which is the script used to start the training, I can read the following:
"""Creates and runs TF2 object detection models.
For local training/evaluation run:
PIPELINE_CONFIG_PATH=path/to/pipeline.config
MODEL_DIR=/tmp/model_outputs
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--pipeline_config_path=$PIPELINE_CONFIG_PATH \
--alsologtostderr
"""
Also, you can specify the num_steps and total_steps parameters in the pipeline.config file (used by the training script):
train_config: {
batch_size: 1
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 50000
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .16
total_steps: 50000
warmup_learning_rate: 0
warmup_steps: 2500
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
So, what Im not understanding is how should I map what is written in the paper with tensorflow parameters.
What is the num steps and total_steps inside the pipeline.config file?
What is the NUM_TRAIN_STEPS argument instead?
Does it overwrite config file steps or its a completely different thing?
If more details are needed feel free to ask.

RuntimeError on running ALBERT for obtaining encoding vectors from text

I’m trying to get feature vectors from the encoder model using pre-trained ALBERT v2 weights. i have a nvidia 1650ti gpu (4 GB) , and sufficient RAM(8GB) but for some reason I’m getting Runtime error saying -
RuntimeError: [enforce fail at …\c10\core\CPUAllocator.cpp:75] data.
DefaultCPUAllocator: not enough memory: you tried to allocate
491520000 bytes. Buy new RAM!
I’m really new to pytorch and deep learning in general. Can anyone please tell me what is wrong?
My entire code -
encoded_test_data = tokenized_test_values[‘input_ids’]
encoded_test_masks = tokenized_test_values[‘attention_mask’]
encoded_train_data = torch.from_numpy(encoded_train_data).to(device)
encoded_masks = torch.from_numpy(encoded_masks).to(device)
encoded_test_data = torch.from_numpy(encoded_test_data).to(device)
encoded_test_masks = torch.from_numpy(encoded_test_masks).to(device)
config = EncoderDecoderConfig.from_encoder_decoder_configs(BertConfig(),BertConfig())
EnD_model = EncoderDecoderModel.from_pretrained(‘albert-base-v2’,config=config)
feature_extractor = EnD_model.get_encoder()
feature_vector = feature_extractor.forward(input_ids=encoded_train_data,attention_mask = encoded_masks)
feature_test_vector = feature_extractor.forward(input_ids = encoded_test_data, attention_mask = encoded_test_masks)
Also 491520000 bytes is about 490 MB which should not be a problem.
I tried reducing the number of training examples and also the length of the maximum padded input . The OOM error still exists even though the required space now is 153 MB , which should easily be managable.
I also have maxed out the RAM limit of the heap of pycharm software to 2048 MB. I really dont know what to do now…

sklearn, Keras, DeepStack - ValueError: multi_class must be in ('ovo', 'ovr')

I trained a set of DNNs and I want to use them in a deep ensemble. The code is implemented in TF2, but the package deepstack works with Keras as well. The code looks something like this
from deepstack.base import KerasMember
from deepstack.ensemble import DirichletEnsemble
dirichletEnsemble = DirichletEnsemble(N=2000 * ensemble_size)
for net_idx in range(0,ensemble_size):
member = KerasMember(name=model_name, keras_model=model,
train_batches=(train_images,train_labels), val_batches=(valid_images, valid_labels))
dirichletEnsemble.add_member(member)
dirichletEnsemble.fit()
where 'model' is essentially a Keras model, thus you need to load one model at each loop (I am using my own implementation). 'ensemble_size' represents the number of DNNs used in the ensemble.
As a result, I get the following error
ValueError: multi_class must be in ('ovo', 'ovr')
which is generated by the sklearn package.
FURTHER DETAILS: deepstack creates a metric
metric = metrics.roc_auc_score
and then returns it as
return metric(y_t, y_p)
which then calls sklearn
if multi_class == 'raise':
raise ValueError("multi_class must be in ('ovo', 'ovr')")
In my specific case, the labels are respectively y_t
[ 7 10 18 52 10 13 10 4 7 7 24 26 7 26 13 13]
and y_p
[ 73 250 250 250 281 281 250 281 281 174 281 250 281 250 250 250]
How do I set multi_class as 'ovo' or 'ovr'?
The documentation for roc_auc_score indicates the following:
roc_auc_score(
y_true,
y_score,
*,
average='macro',
sample_weight=None,
max_fpr=None,
multi_class='raise',
labels=None
)
The second last parameter there is multi_class, which has the following explanation:
Multiclass only. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.
So, it seems that there is some variation in how roc auc is calculated and they are forcing you to explicitly choose which variation you want them to use. If you don't make the choice, the default will result in an exception being raised. And that exception is the error that you are reporting in your question title.
if you are getting this error while using sklearn roc_auc_score library, try roc_auc_score(YTEST,YPRED, multi_class='ovr') ovr is one vs rest which will convert your multiclass problem to a binary problem

How to generate tflite from saved model?

I want to create an object-detection app based on a retrained ssd_mobilenet model I've retrained like the guy on youtube.
I chose the model ssd_mobilenet_v2_coco from the Tensorflow Model Zoo. After the retraining process I've got the model with the following structure:
- saved_model
- variables (empty folder)
- saved_model.pb
- checkpoint
- frozen_inverence_graph.pb
- model.ckpt.data-00000-of-00001
- model.ckpt.index
- model.ckpt.meta
- pipeline.config
In the same folder, I have the python script with the following code:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("saved_model", input_shapes={"image_tensor":[1,300,300,3]})
tflite_model = converter.convert()
open("converted_model.tflite", "wb").write(tflite_model)
After running this code, I got the following error:
...
2019-05-24 18:46:59.811289: I tensorflow/lite/toco/import_tensorflow.cc:1324] Converting unsupported operation: TensorArrayGatherV3
2019-05-24 18:46:59.811864: I tensorflow/lite/toco/import_tensorflow.cc:1373] Unable to determine output type for op: TensorArrayGatherV3
2019-05-24 18:46:59.908207: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before Removing unused ops: 1792 operators, 3033 arrays (0 quantized)
2019-05-24 18:47:00.089034: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] After Removing unused ops pass 1: 1771 operators, 2979 arrays (0 quantized)
2019-05-24 18:47:00.314681: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before general graph transformations: 1771 operators, 2979 arrays (0 quantized)
2019-05-24 18:47:00.453570: F tensorflow/lite/toco/graph_transformations/resolve_constant_slice.cc:59] Check failed: dim_size >= 1 (0 vs. 1)
Is there any solution for the "Check failed: dim_size >= 1 (0 vs. 1)"?
Conversion of MobileNet SSD is a little different due to some Custom ops that are needed in the graph.
Take a look at this Medium post for the end-to-end process of training and exporting the model as a TFLite graph. For conversion, you would need to use the export_tflite_ssd_graph script.

Training model with CreateML MLTextClassifier, stopped by EXC_BAD_ACCESS (code=1, address=0x0)

I'm trying to train my own NLP model with CreateML with Xcode playground, and going through the tutorial by Apple: https://developer.apple.com/documentation/createml/creating_a_text_classifier_model
but the program terminated by EXC_BAD_ACCESS (code=1, address=0x0)
I found some solution from the Internet, they stated that the pointer is pointing to NULL when trying to access the variable
import Foundation
import CreateML
let source = "icecream"
let data = try MLDataTable(contentsOf: URL(fileURLWithPath: "/path/to/\(source).csv"))
let (trainingData, testingData) = data.randomSplit(by: 0.8, seed: 0)
// program stopped here
let sentimentClassifier = try MLTextClassifier(trainingData: trainingData, textColumn: "text", labelColumn: "sentiment")
// error
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0x0).
// output
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 100 lines in 0.03412 secs.
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 188 lines in 0.008235 secs.
Automatically generating validation set from 10% of the data.
Tokenizing data and extracting features
Starting MaxEnt training with 146 samples
Iteration 1 training accuracy 0.650685
Iteration 2 training accuracy 0.869863
Iteration 3 training accuracy 0.945205
Iteration 4 training accuracy 0.986301
Iteration 5 training accuracy 0.993151
Finished MaxEnt training in 0.04 seconds

Resources