How can I use custom tokenizer in opennmt transformer - pytorch

I'm tring to transformer for translation with opennmt-py.
And I already have the tokenizer trained by sentencepiece(unigram).
But I don't know how to use my custom tokenizer in training config yaml.
I'm refering the site of opennmt-docs (https://opennmt.net/OpenNMT-py/examples/Translation.html).
Here are my code and the error .
# original_ko_en.yaml
## Where is the vocab(s)
src_vocab: /workspace/tokenizer/t_50k.vocab
tgt_vocab: /workspace/tokenizer/t_50k.vocab
# Corpus opts:
data:
corpus_1:
path_src: /storage/genericdata_basemodel/train.ko
path_tgt: /storage/genericdata_basemodel/train.en
transforms: [sentencepiece]
weight: 1
valid:
path_src: /storage/genericdata_basemodel/valid.ko
path_tgt: /storage/genericdata_basemodel/valid.en
transforms: [sentencepiece]
#### Subword
src_subword_model: /workspace/tokenizer/t_50k.model
tgt_subword_model: /workspace/tokenizer/t_50k.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# filter
# src_seq_length: 200
# tgt_seq_length: 200
# silently ignore empty lines in the data
skip_empty_level: silent
# Train on a single GPU
world_size: 1
gpu_ranks: [0]
# General opts
save_model: /storage/models/opennmt_v1/opennmt
keep_checkpoint: 100
save_checkpoint_steps: 10000
average_decay: 0.0005
seed: 1234
train_steps: 500000
valid_steps: 20000
warmup_steps: 8000
report_every: 1000
# Model
decoder_type: transformer
encoder_type: transformer
layers: 6
heads: 8
word_vec_size: 512
rnn_size: 512
transformer_ff: 2048
dropout: 0.1
label_smoothing: 0.1
# Optimization
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
normalization: tokens
param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'
# Batching
batch_size: 4096
batch_type: tokens
accum_count: 8
max_generator_batches: 2
# Visualization
tensorboard: True
tensorboard_log_dir: /workspace/runs/onmt1
and When I typing < onmt_train -config xxx.yaml >
So, the questions are two.
my sentencepiece tokenizer embedding is float. How can i resolve the int error?
When training stopped by accident or I want to train more some model.pt
what is the command to start training from the some model.pt ?
I'll look forward to any opinion.
Thanks.

I got the answers.
we can use tools/spm_to_vocab in onmt.
train_from argument is the one.

Related

pytorch cyclegann gives a Missing key error when testing

I have trained a model using the pix2pix pytorch implementation and would like to test it.
However when I test it I get the error
model [CycleGANModel] was created
loading the model from ./checkpoints/cycbw50/latest_net_G_A.pth
Traceback (most recent call last):
File "test.py", line 47, in <module>
model.setup(opt) # regular setup: load and print networks; create schedulers
File "/media/bitlockermount/SmartImageToDigitalTwin/SmartImageToDigitalTwin/bin/python/cyclegann/pytorch-CycleGAN-and-pix2pix/models/base_model.py", line 88, in setup
self.load_networks(load_suffix)
File "/media/bitlockermount/SmartImageToDigitalTwin/SmartImageToDigitalTwin/bin/python/cyclegann/pytorch-CycleGAN-and-pix2pix/models/base_model.py", line 199, in load_networks
net.load_state_dict(state_dict)
File "/home/bst/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 846, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ResnetGenerator:
Missing key(s) in state_dict: "model.1.bias", "model.4.bias", "model.7.bias", "model.10.conv_block.1.bias", "model.10.conv_block.5.bias", "model.11.conv_block.1.bias", "model.11.conv_block.5.bias", "model.12.conv_block.1.bias", "model.12.conv_block.5.bias", "model.13.conv_block.1.bias", "model.13.conv_block.5.bias", "model.14.conv_block.1.bias", "model.14.conv_block.5.bias", "model.15.conv_block.1.bias", "model.15.conv_block.5.bias", "model.16.conv_block.1.bias", "model.16.conv_block.5.bias", "model.17.conv_block.1.bias", "model.17.conv_block.5.bias", "model.18.conv_block.1.bias", "model.18.conv_block.5.bias", "model.19.bias", "model.22.bias".
Unexpected key(s) in state_dict: "model.2.weight", "model.2.bias", "model.5.weight", "model.5.bias", "model.8.weight", "model.8.bias", "model.10.conv_block.2.weight", "model.10.conv_block.2.bias", "model.10.conv_block.6.weight", "model.10.conv_block.6.bias", "model.11.conv_block.2.weight", "model.11.conv_block.2.bias", "model.11.conv_block.6.weight", "model.11.conv_block.6.bias", "model.12.conv_block.2.weight", "model.12.conv_block.2.bias", "model.12.conv_block.6.weight", "model.12.conv_block.6.bias", "model.13.conv_block.2.weight", "model.13.conv_block.2.bias", "model.13.conv_block.6.weight", "model.13.conv_block.6.bias", "model.14.conv_block.2.weight", "model.14.conv_block.2.bias", "model.14.conv_block.6.weight", "model.14.conv_block.6.bias", "model.15.conv_block.2.weight", "model.15.conv_block.2.bias", "model.15.conv_block.6.weight", "model.15.conv_block.6.bias", "model.16.conv_block.2.weight", "model.16.conv_block.2.bias", "model.16.conv_block.6.weight", "model.16.conv_block.6.bias", "model.17.conv_block.2.weight", "model.17.conv_block.2.bias", "model.17.conv_block.6.weight", "model.17.conv_block.6.bias", "model.18.conv_block.2.weight", "model.18.conv_block.2.bias", "model.18.conv_block.6.weight", "model.18.conv_block.6.bias", "model.20.weight", "model.20.bias", "model.23.weight", "model.23.bias".
The opt file for the training:
----------------- Options ---------------
batch_size: 1
beta1: 0.5
checkpoints_dir: ./checkpoints
continue_train: False
crop_size: 256
dataroot: ./datasets/datasets/boundedwalls_50_0.1/ [default: None]
dataset_mode: aligned [default: unaligned]
direction: AtoB
display_env: main
display_freq: 400
display_id: 1
display_ncols: 4
display_port: 8097
display_server: http://localhost
display_winsize: 256
epoch: latest
epoch_count: 1
gan_mode: lsgan
gpu_ids: 0
init_gain: 0.02
init_type: normal
input_nc: 3
isTrain: True [default: None]
lambda_A: 10.0
lambda_B: 10.0
lambda_identity: 0.5
load_iter: 0 [default: 0]
load_size: 286
lr: 0.0002
lr_decay_iters: 50
lr_policy: linear
max_dataset_size: inf
model: cycle_gan
n_epochs: 100
n_epochs_decay: 100
n_layers_D: 3
name: cycbw50 [default: experiment_name]
ndf: 64
netD: basic
netG: resnet_9blocks
ngf: 64
no_dropout: True
no_flip: False
no_html: False
norm: batch [default: instance]
num_threads: 4
output_nc: 3
phase: train
pool_size: 50
preprocess: resize_and_crop
print_freq: 100
save_by_iter: False
save_epoch_freq: 5
save_latest_freq: 5000
serial_batches: False
suffix:
update_html_freq: 1000
verbose: False
----------------- End -------------------
And when testing it I used the following testing settings
python3 test.py --dataroot ./datasets/datasets/boundedwalls_50_0.1 --name cycbw50 --model pix2pix --netG resnet_9blocks --direction BtoA --dataset_mode aligned --norm batch --load_size 286
----------------- Options ---------------
aspect_ratio: 1.0
batch_size: 1
checkpoints_dir: ./checkpoints
crop_size: 256
dataroot: ./datasets/datasets/boundedwalls_50_0.1/ [default: None]
dataset_mode: unaligned
direction: AtoB
display_winsize: 256
epoch: latest
eval: False
gpu_ids: 0
init_gain: 0.02
init_type: normal
input_nc: 3
isTrain: False [default: None]
load_iter: 0 [default: 0]
load_size: 256
max_dataset_size: inf
model: cycle_gan [default: test]
n_layers_D: 3
name: cycbw50 [default: experiment_name]
ndf: 64
netD: basic
netG: resnet_9blocks
ngf: 64
no_dropout: True
no_flip: False
norm: instance
num_test: 50
num_threads: 4
output_nc: 3
phase: test
preprocess: resize_and_crop
results_dir: ./results/
serial_batches: False
suffix:
verbose: False
----------------- End -------------------
Does anybody see what I'm doing wrong here? I would like to run this network such that I get results for individual images, so far the test function seems the most promising but it just crashes on this neural network.
I think the problem here is some layer the bias=None but in testing the model required this, you should check the code for details.
After I check your config in train and test, the norm is different. For the code in GitHub, the norm difference may set the bias term is True or False.
if type(norm_layer) == functools.partial:
use_bias = norm_layer.func == nn.InstanceNorm2d
else:
use_bias = norm_layer == nn.InstanceNorm2d
model = [nn.ReflectionPad2d(3),
nn.Conv2d(input_nc, ngf, kernel_size=7, padding=0, bias=use_bias),
norm_layer(ngf),
nn.ReLU(True)]
You can check it here.

use feature selection to select best 2048 instead of 4096

im still beginner with DL, I'm trying to do image classification using VGG16 pre-trained model and dumb the features into csv file, and I have got 4096 features as the below results :
1 2 3 4 ... 4096
0.12 0.23 0.345 0.5372 ... 0.21111
0.2313 0.321 0.214 0.3542 ... 0.46756
.
.
I'm trying to use SelectKBest feature selection to select the best 2048 features instead of 4096, can you show me how please
i have tried :
data = pd.read_csv("multiClassVGG16.csv")
array = data.values
X = array[:,1:]
Y = array[:,0]
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X,Y)
# Summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# # Summarize selected features
print(features[0:2048,:])
# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 2048)
fit = rfe.fit(X, Y)
print("Num Features: %s" % (fit.n_features_))
print("Selected Features: %s" % (fit.support_))
print("Feature Ranking: %s" % (fit.ranking_))
im just looking to regenerate a new dataframe with best 2048 features to dump it again to csv
desired results :
1 2 3 4 ... 2048
0.12 0.23 0.345 0.5372 ... 0.21111
0.2313 0.321 0.214 0.3542 ... 0.46756
The feature extraction part should be something along the lines of
# Feature extraction
model = LogisticRegression()
rfe = RFE(model, 2048)
rfe.fit(X, Y)
# Extracting 2048 features
feat = rfe.transform(X)
feat.shape
# (n_rows, 2048)
# Save to CSV
np.savetxt("foo.csv", feat, delimiter=",")
You can make use of the fit_transform() method to combine the process of fitting the data to the feature extractor as well as extracting the required features.
Go through the documentation to better understand the additional functionalities available as part of the method.

How can we load caffe2 pre-trained model in keras?

I have pre-trained weights for maskrcnn in caffe2 in .pkl extension and it's config file as yaml. If I try to load it directly it throws Improper config format: . Is there a way to use it without installing caffe2.
Config.py
MODEL:
TYPE: generalized_rcnn
CONV_BODY: FPN.add_fpn_ResNet101_conv5_body
NUM_CLASSES: 6
FASTER_RCNN: True
MASK_ON: True
NUM_GPUS: 8
SOLVER:
WEIGHT_DECAY: 0.0001
LR_POLICY: steps_with_decay
# 1x schedule (note TRAIN.IMS_PER_BATCH: 1)
BASE_LR: 0.01
GAMMA: 0.1
MAX_ITER: 180000
STEPS: [0, 120000, 160000]
FPN:
FPN_ON: True
MULTILEVEL_ROIS: True
MULTILEVEL_RPN: True
MRCNN:
ROI_MASK_HEAD: mask_rcnn_heads.mask_rcnn_fcn_head_v1up4convs
RESOLUTION: 28 # (output mask resolution) default 14
ROI_XFORM_METHOD: RoIAlign
ROI_XFORM_RESOLUTION: 14 # default 7
ROI_XFORM_SAMPLING_RATIO: 2 # default 0
DILATION: 1 # default 2
CONV_INIT: MSRAFill # default GaussianFill
TRAIN:
# md5sum of weights pkl file: aa14062280226e48f569ef1c7212e7c7
DATASETS: ('medline_train',)
SCALES: (400,)
MAX_SIZE: 512
IMS_PER_BATCH: 1
BATCH_SIZE_PER_IM: 512
RPN_PRE_NMS_TOP_N: 2000 # Per FPN level
USE_FLIPPED: False
TEST:
DATASETS: ('medline_val',)
SCALE: 400
MAX_SIZE: 512
NMS: 0.5
RPN_PRE_NMS_TOP_N: 1000 # Per FPN level
RPN_POST_NMS_TOP_N: 1000
FORCE_JSON_DATASET_EVAL: True
OUTPUT_DIR: .

h2o vs scikit learn confusion matrix

Anyone able to match the sklearn confusion matrix to h2o?
They never match....
Doing something similar with Keras produces a perfect match.
But in h2o they are always off. Tried it every which way...
Borrowed some code from:
Any difference between H2O and Scikit-Learn metrics scoring?
# In[30]:
import pandas as pd
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# Import a sample binary outcome train/test set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
test = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
test[y] = test[y].asfactor()
# Train and cross-validate a GBM
model = H2OGradientBoostingEstimator(distribution="bernoulli", seed=1)
model.train(x=x, y=y, training_frame=train)
# In[31]:
# Test AUC
model.model_performance(test).auc()
# 0.7817203808052897
# In[32]:
# Generate predictions on a test set
pred = model.predict(test)
# In[33]:
from sklearn.metrics import roc_auc_score, confusion_matrix
pred_df = pred.as_data_frame()
y_true = test[y].as_data_frame()
roc_auc_score(y_true, pred_df['p1'].tolist())
#pred_df.head()
# In[36]:
y_true = test[y].as_data_frame().values
cm = pd.DataFrame(confusion_matrix(y_true, pred_df['predict'].values))
# In[37]:
print(cm)
0 1
0 1354 961
1 540 2145
# In[38]:
model.model_performance(test).confusion_matrix()
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.353664307031828:
0 1 Error Rate
0 964.0 1351.0 0.5836 (1351.0/2315.0)
1 274.0 2411.0 0.102 (274.0/2685.0)
Total 1238.0 3762.0 0.325 (1625.0/5000.0)
# In[39]:
h2o.cluster().shutdown()
This does the trick, thx for the hunch Vivek. Still not an exact match but extremely close.
perf = model.model_performance(train)
threshold = perf.find_threshold_by_max_metric('f1')
model.model_performance(test).confusion_matrix(thresholds=threshold)
I also meet the same issue. Here is what I would do to make a fair comparison:
model.train(x=x, y=y, training_frame=train, validation_frame=test)
cm1 = model.confusion_matrix(metrics=['F1'], valid=True)
Since we train the model using training data and validation data, then the pred['predict'] will use the threshold which maximizes the F1 score of validation data. To make sure, one can use these lines:
threshold = perf.find_threshold_by_max_metric(metric='F1', valid=True)
pred_df['predict'] = pred_df['p1'].apply(lambda x: 0 if x < threshold else 1)
To get another confusion matrix from scikit learn:
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(y_true, pred_df['predict'])
In my case, I don't understand why I get slightly different results. Something like, for example:
print(cm1)
>> [[3063 176]
[ 94 146]]
print(cm2)
>> [[3063 176]
[ 95 145]]

How should we interpret the results of the H2O predict function?

I have trained and stored a random forest binary classification model. Now I'm trying to simulate processing new (out-of-sample) data with this model. My Python (Anaconda 3.6) code is:
import h2o
import pandas as pd
import sys
localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
h2o.remove_all()
model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
model = h2o.load_model(model_path)
new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
print(new_data.head(10))
predict = model.predict(new_data) # predict returns a data frame
print(predict.describe())
predicted = predict[0,0]
probability = predict[0,2] # probability the prediction is a "1"
print('prediction: ', predicted, ', probability: ', probability)
When I run this code I get:
>>> import h2o
>>> import pandas as pd
>>> import sys
>>> localH2O = h2o.init(ip = "localhost", port = 54321, max_mem_size = "8G", nthreads = -1)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
-------------------------- ------------------------------
H2O cluster uptime: 22 hours 22 mins
H2O cluster version: 3.10.5.4
H2O cluster version age: 18 days
H2O cluster name: H2O_from_python_Charles_0fqq0c
H2O cluster total nodes: 1
H2O cluster free memory: 6.790 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: locked, healthy
H2O connection url: http://localhost:54321
H2O connection proxy:
H2O internal security: False
Python version: 3.6.1 final
-------------------------- ------------------------------
>>> h2o.remove_all()
>>> model_path = "C:/sm/BottleRockets/rf_model/DRF_model_python_1501621766843_28117";
>>> model = h2o.load_model(model_path)
>>> new_data = h2o.import_file(path="C:/sm/BottleRockets/new_data.csv")
Parse progress: |█████████████████████████████████████████████████████████| 100%
>>> print(new_data.head(10))
BoxRatio Thrust Velocity OnBalRun vwapGain
---------- -------- ---------- ---------- ----------
1.502 55.044 0.38 37 0.845
[1 row x 5 columns]
>>> predict = model.predict(new_data) # predict returns a data frame
drf prediction progress: |████████████████████████████████████████████████| 100%
>>> print(predict.describe())
Rows:1
Cols:3
predict p0 p1
------- --------- ------------------ -------------------
type enum real real
mins 0.8849431818181818 0.11505681818181818
mean 0.8849431818181818 0.11505681818181818
maxs 0.8849431818181818 0.11505681818181818
sigma 0.0 0.0
zeros 0 0
missing 0 0 0
0 1 0.8849431818181818 0.11505681818181818
None
>>> predicted = predict[0,0]
>>> probability = predict[0,2] # probability the prediction is a "1"
>>> print('prediction: ', predicted, ', probability: ', probability)
prediction: 1 , probability: 0.11505681818181818
>>>
I am confused by the contents of the "predict" data frame. Please tell me what the numbers in the columns labeled "p0" and "p1" mean. I hope they are probabilities, and as you can see by my code, I am trying to get the predicted classification (0 or 1) and a probability that this classification is correct. Does my code correctly do that?
Any comments will be greatly appreciated.
Charles
p0 is the probability (between 0 and 1) that class 0 is chosen.
p1 is the probability (between 0 and 1) that class 1 is chosen.
The thing to keep in mind is that the "prediction" is made by applying a threshold to p1. That threshold point is chosen depending on whether you want to reduce false positives or false negatives. It's not just 0.5.
The threshold chosen for "the prediction" is max-F1. But you can extract out p1 yourself and threshold it any way you like.
Darren Cook asked me to post the first few lines of my training data. Here is is:
BoxRatio Thrust Velocity OnBalRun vwapGain Altitude
0 0.000 0.000 2.186 4.534 0.361 1
1 0.000 0.000 0.561 2.642 0.909 1
2 2.824 2.824 2.199 4.748 1.422 1
3 0.442 0.452 1.702 3.695 1.186 0
4 0.084 0.088 0.612 1.699 0.700 1
The response column is labeled "Altitude". Class 1 is what I want to see from new "out-of-sample" data. "1" is good, and it means that "Altitude" was reached (true positive). "0" means that "Altitude" was not reached (true negative). In the predict table above, "1" was predicted with a probability of 0.11505681818181818. This does not make sense to me.
Charles

Resources