Tensorboard hangs on Pytorch Profile - pytorch

I'm trying to view the results from my torch.profiler.profile..
Here's my code snippet (minus the entire network that I was profiling in a train loop...)
with SummaryWriter(tb_dir) as writer, open(logfn, "wt", encoding="utf-8") as logfp, \
torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(
wait=1,
warmup=1,
active=2
),
on_trace_ready=torch.profiler.tensorboard_trace_handler(profile_dir, worker_name="hmm")
) as profiler:
while (args.max_epochs is None or epoch < args.max_epochs) and (args.max_time is None or total_elapsed < args.max_time):
train(model, epoch, train_dataset, ema_node_loss, opt, sched, crit) # This is a function defined elsewhere that just does the training loop.
# a whole bunch of other stuff deleted for brevity
profiler.step()
And, it does write into my profile_dir these rather ginormous json files:
-rw-rw-r-- 1 binesh binesh 10049360679 Jun 20 19:40 hmm.1655768373559.pt.trace.json
-rw-rw-r-- 1 binesh binesh 10094463514 Jun 20 20:54 hmm.1655772846630.pt.trace.json
Then, after installing the tensorboard plugin, I go to http://localhost:6006/ and I see it just constantly "loading"... I've left it running for more than an hour to no avail.
Can someone please tell me what I'm doing wrong?
Thanks.

Related

pytorch lightning isn't start fit at local environment

I'm study NLP by pytorch.
I have some problem in my local environment.
Same code in Google Colab is work well, but same code in my local isn't work.
This is the code that doesn't work, and output.
I can't find the same problem.
Code using DRAM fully but doesn't show OOM error.
trainer.fit(
task,
train_dataloaders= train_dataloader,
val_dataloaders= val_dataloader
)
Error Message:
Missing logger folder: d:\lightning_logs
C:\Users\Leecarry\AppData\Roaming\Python\Python310\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:608: UserWarning: Checkpoint directory D:\ exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
C:\Users\Leecarry\AppData\Roaming\Python\Python310\site-packages\pytorch_lightning\core\optimizer.py:380: RuntimeWarning: Found unsupported keys in the optimizer configuration: {'scheduler'}
rank_zero_warn(
| Name | Type | Params
--------------------------------------------------------
0 | model | BertForSequenceClassification | 108 M
--------------------------------------------------------
108 M Trainable params
0 Non-trainable params
108 M Total params
435.680 Total estimated model params size (MB)
Epoch 0: 0%| | 0/6251 [00:00<?, ?it/s]
This image is same code in Google Colab.
Please tell me the reason why fit isn't start, and ideas.
I'm so sorry for my English ability.
Thank you for reading.

Clarifications on training job parameters with Tensorflow

Im using the new Tensorflow object detection API.
I need to replicate training parameters used on a paper but Im a bit confused.
In the paper is stated
When training neural network models, their base confguration is similar to that used to
train on the COCO 2017 dataset. For the unambiguous comparison of the selected models, the total number of
training steps was set to 100 equal to 100′000 iterations of learning.
Inside model_main_tf2.py, which is the script used to start the training, I can read the following:
"""Creates and runs TF2 object detection models.
For local training/evaluation run:
PIPELINE_CONFIG_PATH=path/to/pipeline.config
MODEL_DIR=/tmp/model_outputs
NUM_TRAIN_STEPS=10000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python model_main_tf2.py -- \
--model_dir=$MODEL_DIR --num_train_steps=$NUM_TRAIN_STEPS \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--pipeline_config_path=$PIPELINE_CONFIG_PATH \
--alsologtostderr
"""
Also, you can specify the num_steps and total_steps parameters in the pipeline.config file (used by the training script):
train_config: {
batch_size: 1
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
num_steps: 50000
optimizer {
momentum_optimizer: {
learning_rate: {
cosine_decay_learning_rate {
learning_rate_base: .16
total_steps: 50000
warmup_learning_rate: 0
warmup_steps: 2500
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
So, what Im not understanding is how should I map what is written in the paper with tensorflow parameters.
What is the num steps and total_steps inside the pipeline.config file?
What is the NUM_TRAIN_STEPS argument instead?
Does it overwrite config file steps or its a completely different thing?
If more details are needed feel free to ask.

sklearn, Keras, DeepStack - ValueError: multi_class must be in ('ovo', 'ovr')

I trained a set of DNNs and I want to use them in a deep ensemble. The code is implemented in TF2, but the package deepstack works with Keras as well. The code looks something like this
from deepstack.base import KerasMember
from deepstack.ensemble import DirichletEnsemble
dirichletEnsemble = DirichletEnsemble(N=2000 * ensemble_size)
for net_idx in range(0,ensemble_size):
member = KerasMember(name=model_name, keras_model=model,
train_batches=(train_images,train_labels), val_batches=(valid_images, valid_labels))
dirichletEnsemble.add_member(member)
dirichletEnsemble.fit()
where 'model' is essentially a Keras model, thus you need to load one model at each loop (I am using my own implementation). 'ensemble_size' represents the number of DNNs used in the ensemble.
As a result, I get the following error
ValueError: multi_class must be in ('ovo', 'ovr')
which is generated by the sklearn package.
FURTHER DETAILS: deepstack creates a metric
metric = metrics.roc_auc_score
and then returns it as
return metric(y_t, y_p)
which then calls sklearn
if multi_class == 'raise':
raise ValueError("multi_class must be in ('ovo', 'ovr')")
In my specific case, the labels are respectively y_t
[ 7 10 18 52 10 13 10 4 7 7 24 26 7 26 13 13]
and y_p
[ 73 250 250 250 281 281 250 281 281 174 281 250 281 250 250 250]
How do I set multi_class as 'ovo' or 'ovr'?
The documentation for roc_auc_score indicates the following:
roc_auc_score(
y_true,
y_score,
*,
average='macro',
sample_weight=None,
max_fpr=None,
multi_class='raise',
labels=None
)
The second last parameter there is multi_class, which has the following explanation:
Multiclass only. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.
So, it seems that there is some variation in how roc auc is calculated and they are forcing you to explicitly choose which variation you want them to use. If you don't make the choice, the default will result in an exception being raised. And that exception is the error that you are reporting in your question title.
if you are getting this error while using sklearn roc_auc_score library, try roc_auc_score(YTEST,YPRED, multi_class='ovr') ovr is one vs rest which will convert your multiclass problem to a binary problem

Training model with CreateML MLTextClassifier, stopped by EXC_BAD_ACCESS (code=1, address=0x0)

I'm trying to train my own NLP model with CreateML with Xcode playground, and going through the tutorial by Apple: https://developer.apple.com/documentation/createml/creating_a_text_classifier_model
but the program terminated by EXC_BAD_ACCESS (code=1, address=0x0)
I found some solution from the Internet, they stated that the pointer is pointing to NULL when trying to access the variable
import Foundation
import CreateML
let source = "icecream"
let data = try MLDataTable(contentsOf: URL(fileURLWithPath: "/path/to/\(source).csv"))
let (trainingData, testingData) = data.randomSplit(by: 0.8, seed: 0)
// program stopped here
let sentimentClassifier = try MLTextClassifier(trainingData: trainingData, textColumn: "text", labelColumn: "sentiment")
// error
error: Execution was interrupted, reason: EXC_BAD_ACCESS (code=1, address=0x0).
// output
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 100 lines in 0.03412 secs.
Finished parsing file /path/to/icecream.csv
Parsing completed. Parsed 188 lines in 0.008235 secs.
Automatically generating validation set from 10% of the data.
Tokenizing data and extracting features
Starting MaxEnt training with 146 samples
Iteration 1 training accuracy 0.650685
Iteration 2 training accuracy 0.869863
Iteration 3 training accuracy 0.945205
Iteration 4 training accuracy 0.986301
Iteration 5 training accuracy 0.993151
Finished MaxEnt training in 0.04 seconds

OpenCV Error: Assertion failed (_img.rows * _img.cols == vecSize)

I keep getting this error
OpenCV Error: Assertion failed (_img.rows * _img.cols == vecSize) in get, file /build/opencv-SviWsf/opencv-2.4.9.1+dfsg/apps/traincascade/imagestorage.cpp, line 157
terminate called after throwing an instance of 'cv::Exception'
what(): /build/opencv-SviWsf/opencv-2.4.9.1+dfsg/apps/traincascade/imagestorage.cpp:157: error: (-215) _img.rows * _img.cols == vecSize in function get
Aborted (core dumped)
when running opencv_traincascade. I run with these arguments: opencv_traincascade -data data -vec positives.vec -bg bg.txt -numPos 1600 -numNeg 800 -numStages 10 -w 20 -h 20.
My project build is as follows:
workspace
|__bg.txt
|__data/ # where I plan to put cascade
|__info/
|__ # all samples
|__info.lst
|__jersey5050.jpg
|__neg/
|__ # neg images
|__opencv/
|__positives.vec
before I ran opencv_createsamples -img jersey5050.jpg -bg bg.txt -info info/info.lst -maxxangle 0.5 - maxyangle 0.5 -maxzangle 0.5 -num 1800
Not quite sure why I'm getting this error. The images are all converted to greyscale as well. The neg's are sized at 100x100 and jersey5050.jpg is sized at 50x50. I saw someone had a the same error on the OpenCV forums and someone suggested deleting the backup .xml files that are created b OpenCV in case the training is "interrupted". I deleted those and nothing. Please help! I'm using python 3 on mac. I'm also running these commands on an ubuntu server from digitalocean with 2GB of ram but I don't think that's part of the problem.
EDIT
Forgot to mention, after the opencv_createsamples command, i then ran opencv_createsamples -info info/info.lst -num 1800 -w 20 -h20 -vec positives.vec
I solved it haha. Even though I specified in the command the width and height to be 20x20, it changed it to 20x24. So the opencv_traincascade command was throwing an error. Once I changed the width and height arguments in the opencv_traincascade command it worked.
This error is observed when the parameters passed is not matching with the vec file generated, as rightly put by the terminal in this line
Assertion failed (_img.rows * _img.cols == vecSize)
opencv_createsamples displays the parameters passed to it for training. Please verify of the parameters used for creating samples are the same that you passed. I have attached the terminal log for reference.
mayank#mayank-Aspire-A515-51G:~/programs/opencv/CSS/homework/HAAR_classifier/dataset$ opencv_createsamples -info pos.txt -num 235 -w 40 -h 40 -vec positives_test.vec
Info file name: pos.txt
Img file name: (NULL)
Vec file name: positives_test.vec
BG file name: (NULL)
Num: 235
BG color: 0
BG threshold: 80
Invert: FALSE
Max intensity deviation: 40
Max x angle: 1.1
Max y angle: 1.1
Max z angle: 0.5
Show samples: FALSE
Width: 40 <--- confirm
Height: 40 <--- confirm
Max Scale: -1
RNG Seed: 12345
Create training samples from images collection...
Done. Created 235 samples

Resources