How to load custom data into tfds for keras cyclegan example? - python-3.x

As per the example in https://keras.io/examples/generative/cyclegan/, a pre-existing dataset has been loaded for implementation. I am trying to add my dataset.
import tensorflow_datasets as tfds
data = tfds.folder_dataset.ImageFolder('Images', shape=(256, 256, 3))
ds = data.as_dataset()
where 'Images' is the root folder containing two subfolders train and test. train folder containing trainA and trainB , test containing testA and testB.
However, I am unable to understand on how to access trainA , trainB , testA and testB so that it gets accepted by keras cyclegan example.

Best practice is to write your own tensorflow dataset
you can do so with the TFDS CLI (command line interface).
Install the TFDS CLI: pip install -q tfds-nightly
Navigate into the directory of your dataset: cd path/to/my/project/datasets/
Create a new dataset: tfds new my_dataset
[...] Manually modify my_dataset/my_dataset.py to implement your dataset.
Navigate into your new dataset: cd my_dataset/
Build your new TFDS dataset: tfds build
Within your project you then need to import your dataset
import my.project.datasets.my_dataset
and access it as you would any other tfds dataset:
ds = tfds.load('my_dataset')
Tensorflow documentation for adding a dataset is to be found here.

Cant write a comment yet but I think this may help some others: kosas Pipeline was working for me, I did optional renamings for my usecase. But I could't load the dataset with the current tensorflow example for cycleGAN (https://www.tensorflow.org/tutorials/generative/cyclegan)
I used
tfds.load("Soiled")
and I got the errormessage, a 'label' was not found. I found a solution (TypeError: tf__normalize_img() missing 1 required positional argument: 'label') where it states that you have to use
tfds.load("Soiled", as_supervised=True)
as otherwise the data is loaded as a dictionary and not as a needed tulpe of (image, label)
This addon worked for me.

I curated/wrote the whole code here
https://github.com/asokraju/Soiled
and added a read me file with specific instructions on how-to. Hope this is helpful
Custom Tensorflow Input Pipeline for Cycle GANs
Steps to create the dataset
Organize the data set inside a Data.zip file
trainA
trainB
testA
testB
A and B represents the two classes.
Provide the path ( of the Data.zip file ) in line 28 of Soiled.py i.e.,
_DL_URLS = Soiled":"C:\\Users\\<user>\\Downloads\\Data_001.zip"}
cd into Soiled folder and use tfds build command to build the data
The Tensorflow record files can be found at C:\Users\<user>\tensorflow_datasets\soiled. If needed, these files can be taken elsewhere to use.
loading the data
There are multiple ways to do it.
Import the necessary packages:
import tensorflow as tf
import tensorflow_datasets as tfds
import sys
Ensure that the path to Soiled folder containg the code, NOT the data generated, is accessable to the code. For this I have added the path as follows:
sys.path.insert(1, 'C:\\Users\\<user>\\Downloads\\')
Then the data can be loaded using:
ds = tfds.load('Soiled')
ds
{'trainA': <PrefetchDataset shapes: {image: (None, None, 3), label: ()}, types: {image: tf.uint8, label: tf.int64}>,
'trainB': <PrefetchDataset shapes: {image: (None, None, 3), label: ()}, types: {image: tf.uint8, label: tf.int64}>,
'testA': <PrefetchDataset shapes: {image: (None, None, 3), label: ()}, types: {image: tf.uint8, label: tf.int64}>,
'testB': <PrefetchDataset shapes: {image: (None, None, 3), label: ()}, types: {image: tf.uint8, label: tf.int64}>}
test:
next(iter(ds['trainA']))
Output exceeds the size limit. Open the full output data in a text editor
{'image': <tf.Tensor: shape=(1200, 1920, 3), dtype=uint8, numpy=
array([[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[115, 173, 187],
[112, 174, 197],
[108, 172, 199]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[119, 170, 191],
[115, 165, 192],
[117, 168, 197]],
[[255, 255, 255],
[255, 255, 255],
[255, 255, 255],
...,
[109, 145, 179],
[134, 162, 199],
[134, 158, 194]],
...
...,
[ 72, 95, 67],
[ 78, 99, 66],
[ 79, 99, 62]]], dtype=uint8)>,
'label': <tf.Tensor: shape=(), dtype=int64, numpy=0>}
Steps used to create the folder structure.
Install tensorflow_datasets package
On Command line type tfds new Soiled. This will create a Soiled folder with file structure
checksums.tsv
dummy_data/
Soiled.py
Soiled_test.py
edit Soiled.py as needed.
Possible issues:
If it fails to build the pipeline, delete the folder tesorflow_datasets folder BEFORE you retry. In windows it can found at C\users\<user>.
If it gives an error something similar to
# tensorflow.python.framework.errors_impl.NotFoundError: Could not find directory C:\Users\<user>\tensorflow_datasets\downloads\extracted\ZIP.Users_kkosara_Downloads_Data_18r38_Co4F-G6ka9wRk2wGFbDPqLZu8TekEV7s9L9enI.zip\testA\trainA
try changing the data_dirs in lines to path_to_dataset or something that ensures it has the correct path to the downloaded data.
Ensure that the folder structure is proper
1. Organize the data set inside a `Data.zip` file
trainA
trainB
testA
testB
A and B represents the two classes.
also ensure that there are nothing else except the image files inside the folder.
Used Resources
How to load custom data into tfds for keras cyclegan example?
https://www.tensorflow.org/datasets/cli
https://www.tensorflow.org/datasets/catalog/cycle_gan
https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/generative/cyclegan.ipynb#scrollTo=Ds4o1h4WHz9U

Related

databricks/spark/python/pyspark/serializers.py AttributeError: 'str' object has no attribute 'get'

When executing the following code provide by databricks, an serialization error appears.
The code is basically an hyperopt optimization of the XGboost in the databricks environment. This code is part of an end-to-end tutorial provided by databricks.
Code:
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
from hyperopt.pyll import scope
from math import exp
import mlflow.xgboost
import numpy as np
import xgboost as xgb
pyspark.InheritableThread
#mlflow.set_experiment("/Shared/experiments/ichi")
search_space = {
'max_depth': scope.int(hp.quniform('max_depth', 4, 100, 1)),
'learning_rate': hp.loguniform('learning_rate', -3, 0),
'reg_alpha': hp.loguniform('reg_alpha', -5, -1),
'reg_lambda': hp.loguniform('reg_lambda', -6, -1),
'min_child_weight': hp.loguniform('min_child_weight', -1, 3),
'objective': 'binary:logistic',
'seed': 123, # Set a seed for deterministic training
}
def train_model(params):
# With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
mlflow.xgboost.autolog()
with mlflow.start_run(nested=True):
train = xgb.DMatrix(data=X_train, label=y_train)
validation = xgb.DMatrix(data=X_val, label=y_val)
# Pass in the validation set so xgb can track an evaluation metric. XGBoost terminates training when the evaluation metric
# is no longer improving.
booster = xgb.train(params=params, dtrain=train, num_boost_round=1000,\
evals=[(validation, "validation")], early_stopping_rounds=50)
validation_predictions = booster.predict(validation)
auc_score = roc_auc_score(y_val, validation_predictions)
mlflow.log_metric('auc', auc_score)
signature = infer_signature(X_train, booster.predict(train))
mlflow.xgboost.log_model(booster, "model", signature=signature)
# Set the loss to -1*auc_score so fmin maximizes the auc_score
return {'status': STATUS_OK, 'loss': -1*auc_score, 'booster': booster.attributes()}
# Greater parallelism will lead to speedups, but a less optimal hyperparameter sweep.
# A reasonable value for parallelism is the square root of max_evals.
spark_trials = SparkTrials(parallelism=10)
# Run fmin within an MLflow run context so that each hyperparameter configuration is logged as a child run of a parent
# run called "xgboost_models" .
with mlflow.start_run(run_name='xgboost_models'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals=96,
trials=spark_trials,
)
The error is:
/databricks/spark/python/pyspark/rdd.py:980: FutureWarning: Deprecated in 3.1, Use pyspark.InheritableThread with the pinned thread mode enabled.
warnings.warn(
0%| | 0/96 [00:00<?, ?trial/s, best loss=?]trial task 0 failed, exception is Caused by Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 165, in _read_with_length
return self.loads(obj)
File "/databricks/spark/python/pyspark/serializers.py", line 469, in loads
return pickle.loads(obj, encoding=encoding)
File "/databricks/python/lib/python3.8/site-packages/mlflow/exceptions.py", line 83, in __init__
error_code = json.get("error_code", ErrorCode.Name(INTERNAL_ERROR))
AttributeError: 'str' object has no attribute 'get'
.
None
This code is excuted in the databricks notebook. I tried different versions for mlflow, pyspark and hyperopt but without sucess.
The SparkTrials automatically track its runs, and it clashes with the tracking you explicitly set in the train_model function. Just remove all of the mlflow calls from the train_model and you are good to go.

Saving optuna study.pkl in Google Colab

I'm tuning my ML model on Google Colab but I don't know how to save that model to pkl.
import time
import optuna
study_name = "/gdrive/MyDrive/Colab Notebooks/test/params_{}".format(time.strftime("%Y%m%d-%H%M%S"))
study=optuna.create_study(study_name, direction='maximize')
The codes show me this error:
Could not parse rfc1738 URL from string '/gdrive/MyDrive/Colab Notebooks/test/params_20220217-181559'
What should I do to save this model?
You mean save the study ?
https://optuna.readthedocs.io/en/stable/faq.html#how-can-i-save-and-resume-studies
I use this :
install joblib
import joblib
# Let's say I want to save study to savepath + "xgb_optuna_study_batch.pkl"
joblib.dump(study, f"{savepath}xgb_optuna_study_batch.pkl") # save study
# to load it:
jl = joblib.load(f"{savepath}xgb_optuna_study_batch.pkl")
print(jl.best_trial.params)
# output, for example:
{'lambda': 1.4556073038174557, 'alpha': 0.007250895998233471, 'colsample_bytree': 0.7, 'subsample': 0.8, 'learning_rate': 0.01, 'max_depth': 20, 'random_state': 48, 'min_child_weight': 1}

HuggingFace-Transformers --- NER single sentence/sample prediction

I am trying to predict with the NER model, as in the tutorial from huggingface (it contains only the training+evaluation part).
I am following this exact tutorial here : https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb
The training works flawlessly, but the problems that I have begin when I try to predict on a simple sample.
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
loaded_model = AutoModel.from_pretrained('./my_model_own_custom_training.pth',
from_tf=False)
input_sentence = "John Nash is a great mathematician, he lives in France"
tokenized_input_sentence = tokenizer([input_sentence],
truncation=True,
is_split_into_words=False,
return_tensors='pt')
predictions = loaded_model(tokenized_input_sentence["input_ids"])[0]
Predictions is of shape (1,13,768)
How can I arrive at the final result of the form [JOHN <-> ‘B-PER’, … France <-> “B-LOC”], where B-PER and B-LOC are two ground truth labels, representing the tag for a person and location respectively?
The result of the prediction is:
torch.Size([1, 13, 768])
If I write:
print(predictions.argmax(axis=2))
tensor([613, 705, 244, 620, 206, 206, 206, 620, 620, 620, 477, 693, 308])
I get the tensor above.
However I would have expected to get the tensor representing the ground truth [0…8] labels from the ground truth annotations.
Summary when loading the model :
loading configuration file ./my_model_own_custom_training.pth/config.json
Model config DistilBertConfig {
“name_or_path": “distilbert-base-uncased”,
“activation”: “gelu”,
“architectures”: [
“DistilBertForTokenClassification”
],
“attention_dropout”: 0.1,
“dim”: 768,
“dropout”: 0.1,
“hidden_dim”: 3072,
“id2label”: {
“0”: “LABEL_0”,
“1”: “LABEL_1”,
“2”: “LABEL_2”,
“3”: “LABEL_3”,
“4”: “LABEL_4”,
“5”: “LABEL_5”,
“6”: “LABEL_6”,
“7”: “LABEL_7”,
“8”: “LABEL_8”
},
“initializer_range”: 0.02,
“label2id”: {
“LABEL_0”: 0,
“LABEL_1”: 1,
“LABEL_2”: 2,
“LABEL_3”: 3,
“LABEL_4”: 4,
“LABEL_5”: 5,
“LABEL_6”: 6,
“LABEL_7”: 7,
“LABEL_8”: 8
},
“max_position_embeddings”: 512,
“model_type”: “distilbert”,
“n_heads”: 12,
“n_layers”: 6,
“pad_token_id”: 0,
“qa_dropout”: 0.1,
“seq_classif_dropout”: 0.2,
“sinusoidal_pos_embds”: false,
"tie_weights”: true,
“transformers_version”: “4.8.1”,
“vocab_size”: 30522
}
The answer is a bit trickier than expected[Huge credits to Niels Rogge].
Firstly, loading models in huggingface-transformers can be done in (at least) two ways:
AutoModel.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
AutoModelForTokenClassification.from_pretrained('./my_model_own_custom_training.pth', from_tf=False)
It seems that, according to the task at hand, different AutoModels subclasses need to be used. In this scenario I posted, it is the AutoModelForTokenClassification() that has to be used.
After that, a solution to obtain the predictions would be to do the following:
# forward pass
outputs = model(**encoding)
logits = outputs.logits
predictions = logits.argmax(-1)

Why does torch.nn.Upsample return a junk image?

When I execute the code segment below, nn.Upsample seems to be completely destroying my image. Am I applying it in the wrong way?
import torch
import imageio
import torch.nn as nn
from matplotlib import pyplot as plt
small = imageio.imread('small.png') # shape 200, 390, 4
small_reshaped = small.reshape(4, 200, 390) # shape 4, 200, 390
batch = torch.as_tensor(small_reshaped).unsqueeze(0) # shape 1, 4, 200, 390
ups = nn.Upsample((500, 970))
upsampled_batch = ups(batch) # shape 1, 4, 500, 970
upsampled_small = upsampled_batch[0].reshape(500, 970, 4) # shape 500, 970, 4
plt.imshow(small)
plt.imshow(upsampled_small)
plt.show()
Before upsampling:
After upsampling:
Original image (small.png):
Resolved it. Reshaping destroys the image. I should have transposed instead.
See https://discuss.pytorch.org/t/for-beginners-do-not-use-view-or-reshape-to-swap-dimensions-of-tensors/75524 for more details.
A working solution:
...
small_reshaped = small.transpose(2, 0, 1) # shape 4, 200, 390
...
upsampled_small = upsampled_batch[0].transpose(0,1).transpose(1,2) # shape 500, 970, 4
...

"No such file or directory:" error when using Keras to load images

I am following this tutorial to build a image binary classifier. Following the instructions, my code below is trying to load an image "test_00000.png" in order to randomly generate 20 transformations of the original image:
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
datagen = ImageDataGenerator(
rotation_range=40,
width_shift_range=0.2,
height_shift_range=0.2,
rescale=1./255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
fill_mode='nearest')
img = load_img('/Users/Steven/data/image_data/test_00000.png') # this is a PIL image
x = img_to_array(img) # this is a Numpy array with shape (3, 150, 150)
x = x.reshape((1,) + x.shape) # this is a Numpy array with shape (1, 3, 150, 150)
# the .flow() command below generates batches of randomly transformed images
# and saves the results to the `preview/` directory
i = 0
for batch in datagen.flow(x, batch_size=1,
save_to_dir='/Users/Steven/data/image_data/preview', save_prefix='test', save_format='png'):
i += 1
if i > 20:
break # otherwise the generator would loop indefinitely
But I got an error:
Traceback (most recent call last):
File "tf.py", line 28, in <module>
save_to_dir='/Users/Steven/data/image_data/preview', save_prefix='test', save_format='png'):
File "C:\Users\Steven\Anaconda3\lib\site-packages\keras\preprocessing\image.py", line 1069, in __next__
return self.next(*args, **kwargs)
File "C:\Users\Steven\Anaconda3\lib\site-packages\keras\preprocessing\image.py", line 1189, in next
return self._get_batches_of_transformed_samples(index_array)
File "C:\Users\Steven\Anaconda3\lib\site-packages\keras\preprocessing\image.py", line 1171, in _get_batches_of_transformed_samples
img.save(os.path.join(self.save_to_dir, fname))
File "C:\Users\Steven\Anaconda3\lib\site-packages\PIL\Image.py", line 1932, in save
fp = builtins.open(filename, "w+b")
FileNotFoundError: [Errno 2] No such file or directory: '/Users/Steven/data/image_data/preview\\test_0_2149.png'
I have manually built a folder named "preview" in /User/Steven/data/image_data, which is why I don't understand how this error occurs. Your help is appreciated!
I was getting the same issue and I think the reason is the API call has changed a little bit. Instead one can use :
tf.keras.preprocessing.image.load_img(
path, grayscale=False, color_mode='rgb', target_size=None,
interpolation='nearest' )
remember to change the path :)
The correct path should be
'C:\\Users\\Steven\\data\\image_data\\preview\\test_0_2149.png'
You forget the partition letter C:\\
P.S.: I was having a similar problem on Windows and discovered that the full paths of some of my images had more than 260 characters, which is a limitation on Windows. My solution was to move the folder to a shallowest path.

Resources