Inception v3, train test and validation splits in retrain.py - python-3.x

I have been using tensorflow's retrain.py script for the inception v3 model to build a multilabel classification model, I found some code in the script that I don't fully understand, I'm asking for help clarifying what it does:
for file_name in file_list:
base_name = os.path.basename(file_name)
# We want to ignore anything after '_nohash_' in the file name when
# deciding which set to put an image in, the data set creator has a way of
# grouping photos that are close variations of each other. For example
# this is used in the plant disease data set to group multiple pictures of
# the same leaf.
hash_name = re.sub(r'_nohash_.*$', '', file_name)
# This looks a bit magical, but we need to decide whether this file should
# go into the training, testing, or validation sets, and we want to keep
# existing files in the same set even if more files are subsequently
# added.
# To do that, we need a stable way of deciding based on just the file name
# itself, so we do a hash of that and then use that to generate a
# probability value that we use to assign it.
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) *
(100.0 / MAX_NUM_IMAGES_PER_CLASS))
if percentage_hash < validation_percentage:
validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name)
Is there a way to name the images to decide which ones go to the train set, validation set or test set? If someone has a link for the plant disease dataset mentioned here that would also be helpful, thanks!

Related

How to Integrate Dedupe Active Learning functinolity (Console_label) with restful api

below code to train csv model and active learning involved init how to integrate console_label function with restful api(eg fastapi)
Create a new deduper object and pass our data model to it.
deduper = dedupe.Dedupe(fields)
# If we have training data saved from a previous run of dedupe,
# look for it and load it in.
# __Note:__ if you want to train from scratch, delete the training_file
if os.path.exists(training_file):
print('reading labeled examples from ', training_file)
with open(training_file, 'rb') as f:
deduper.prepare_training(data_d, f)
else:
deduper.prepare_training(data_d)
# ## Active learning
# Dedupe will find the next pair of records
# it is least certain about and ask you to label them as duplicates
# or not.
# use 'y', 'n' and 'u' keys to flag duplicates
# press 'f' when you are finished
print('starting active labeling...')
dedupe.console_label(deduper)
# Using the examples we just labeled, train the deduper and learn
# blocking predicates
deduper.train()

How to download pipeline code from Pycaret AutoML into .py files?

PyCaret seems like a great AutoML tool. It works, fast and simple and I would like to download the generated pipeline code into .py files to double check and if needed to customize some parts. Unfortunately, I don't know how to make it real. Reading the documentation have not helped. Is it possible or not?
It is not possible to get the underlying code since PyCaret takes care of this for you. But it is up to you as the user to decide the steps that you want your flow to take e.g.
# Setup experiment with user-defined options for preprocessing, etc.
setup(...)
# Create a model (uses training split only)
model = create_model("lr")
# Tune hyperparameters (user can pass a custom tuning grid if needed)
# Again, uses training split only
tuned = tune_model(model, ...)
# Finalize model (so that the best hyperparameters are retrained on the entire dataset
finalize_model(tuned)
# Any other steps you would like to do.
...
Finally, you can save the entire pipeline as a pkl file for use later
# Saves the model + pipeline as a pkl file
save_model(final, "my_best_model")
You may get a partial answer: incomplete with 'get_config("prep_pipe")' in 2.6.10 or in 3.0.0rc1
Just run a setup like in examples, store as a cdf1, and try cdf.pipeline and you may get a text like this: Pipeline(..)
When working with pycaret=3.0.0rc4, you have two options.
Option 1:
get_config("pipeline")
Option 2:
lb = get_leaderboard()
lb.iloc[0]['Model']
Option 1 will give you the transformations done to the data whilst option 2 will give you the same plus the model and its parameters.
Here's some sample code (from a notebook, based on their documentation on the Binary Classification Tutorial (CLF101) - Level Beginner):
from pycaret.datasets import get_data
from pycaret.classification import *
dataset = get_data('credit')
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
exp_clf101 = setup(data = data, target = 'default', session_id=123)
best = compare_models()
evaluate_model(best)
# OPTION 1
get_config("pipeline")
# OPTION 2
lb = get_leaderboard()
lb.iloc[0]['Model']

Loading saved NER back into HuggingFace pipeline?

I am doing some research into HuggingFace's functionalities for transfer learning (specifically, for named entity recognition). To preface, I am a bit new to transformer architectures. I briefly walked through their example off of their website:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
What I would like to do is save and run this locally without having to download the "ner" model every time (which is over 1 GB in size). In their documentation, I see that you can save the pipeline using the "pipeline.save_pretrained()" function to a local folder. The results of this are various files which I am storing into a specific folder.
My question would be how can I load this model back up into a script to continue classifying as in the example above after saving? The output of "pipeline.save_pretrained()" is multiple files.
Here is what I have tried so far:
1: Following the documentation about pipeline
pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')
The error I got was: 'str' object has no attribute "config"
2: Following HuggingFace example on ner:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
This yields an error: list index out of range
I also tried printing out just predictions which is not returning the text format of the tokens along with their entities.
Any help would be much appreciated!
Loading a model like this has always worked for me:
from transformers import pipeline
pipe = pipeline('token-classification', model=model_folder, tokenizer=model_folder)
Have a look at here for further examples on how to use pipelines.

Fastbert: BertDataBunch error for multilabel text classification

I'm following the FastBert tutorial from huggingface https://medium.com/huggingface/introducing-fastbert-a-simple-deep-learning-library-for-bert-models-89ff763ad384
The problem is this the code is not exactly reproducible. The main issue I'm facing is the dataset preparation. In the tutorial, this dataset is used https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data
But, if I set-up the folder structure according the tutorial, and place the dataset files in the folders I get errors with the databunch.
databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv',
test_data='test.csv',
text_col="comment_text", label_col=label_cols,
batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'],
multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)
It complains about the file format being wrong. How should I format the dataset, labels for this dataset with fastbert?
First of all, you can use the notebook from GitHub for FastBert.
https://github.com/kaushaltrivedi/fast-bert/blob/master/sample_notebooks/new-toxic-multilabel.ipynb
There is a small tutorial in the FastBert README on how to process the dataset before using.
Create a DataBunch object
The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT, RoBERTa, DistilBERT or XLNet. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.
from fast_bert.data_cls import BertDataBunch
databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
tokenizer='bert-base-uncased',
train_file='train.csv',
val_file='val.csv',
label_file='labels.csv',
text_col='text',
label_col='label',
batch_size_per_gpu=16,
max_seq_length=512,
multi_gpu=True,
multi_label=False,
model_type='bert')
File format for train.csv and val.csv
index text label
0 Looking through the other comments, I'm amazed that there aren't any warnings to potential viewers of what they have to look forward to when renting this garbage. First off, I rented this thing with the understanding that it was a competently rendered Indiana Jones knock-off. neg
1 I've watched the first 17 episodes and this series is simply amazing! I haven't been this interested in an anime series since Neon Genesis Evangelion. This series is actually based off an h-game, which I'm not sure if it's been done before or not, I haven't played the game, but from what I've heard it follows it very well pos
2 his movie is nothing short of a dark, gritty masterpiece. I may be bias, as the Apartheid era is an area I've always felt for. pos
In case the column names are different than the usual text and labels, you will have to provide those names in the databunch text_col and label_col parameters.
labels.csv will contain a list of all unique labels. In this case the file will contain:
pos
neg
For multi-label classification, labels.csv will contain all possible labels:
severe_toxic
obscene
threat
insult
identity_hate
The file train.csv will then contain one column for each label, with each column value being either 0 or 1. Don't forget to change multi_label=True for multi-label classification in BertDataBunch.
id text toxic severe_toxic obscene threat insult identity_hate
0 Why the edits made under my username Hardcore Metallica Fan were reverted? 0 0 0 0 0 0
0 I will mess you up 1 0 0 1 0 0
label_col will be a list of label column names. In this case it will be:
['toxic','severe_toxic','obscene','threat','insult','identity_hate']
So, just keep the train.csv, val.csv (just make a copy of train.csv), and test.csv inside data/
In the labels folder, keep a labels.csv with the following contents.
severe_toxic
obscene
threat
insult
identity_hate

How to load video (image sequences) to become data input for CNN-LSTM deep learning model?

In my project, I have two class for the data, "abnormal" and "normal". I have separated training and validation folder for this two type of data. The structure as follow:
Each video file includes the image frames of the video clip, different video folder contain different number of frame. Now, I need to load the frames into the model as the data input and label them into one group. How can i do that? I use keras API.
train
------abnormal
--------------video1
---------------------image1
---------------------image3
---------------------image3
---------------------image...
--------------video2
--------------video3
--------------video...
------normal
(the same as above)
You can have all the images into a single folder and rename them such as 1.normal.jpg, 1.abnormal.jpg. This way you will reduce the time taken to open the folder of a particular class and read images.
Once you rename them, you can split into labels using the given method:
def label(img):
word_label = img.split('.')[-2]
if word_label == 'normal':
return 1
elif word_label == 'abnormal':
return 0

Resources