Way to print invalid filenames for generator in keras

Way to print invalid filenames for generator in keras - keras

I have a big dataframe that I pass to a
generator.flow_from_dataframe(df,...)
but when I run it, I have
UserWarning: Found 52 invalid image filename(s) in x_col="image". These filename(s) will be ignored.
.format(n_invalid, x_col)
There is a way to print these invalid image filenames or to understand their indexes in the df?

I had a similar error, bypassed it by using the flag validate_filenames=False in the flow_from_dataframe method
Currently, AFAIK there is no way to list the file names from the data-frame that do not map to the image directory, in Keras
You can write your python method to list the differences or there must be a 3rd party library which does the same

The issue can be fixed by specifying the absolute path for images within the dataframe.
Assuming :
The dataframe df contains two columns - Image(X) & Class(Y)
Images are stored in train_dir
(Image dataframe structure)
# Specifying absolute path for images in the data frame
abs_file_names = []
for file_name in df['Image']:
tmp = os.path.abspath(train_dir+os.sep+file_name)
abs_file_names.append(tmp)
# update dataframe
df['Image'] = abs_file_names
datagen = ImageDataGenerator(rescale=1./255.,validation_split=0.25)
a = datagen.flow_from_dataframe(
dataframe = df,
train_dir=None,
x_col="Image",
y_col="Class",
weight_col=None,
target_size=(150, 150),
color_mode="rgb",
classes=None,
class_mode="categorical",
batch_size=32,
shuffle=True,
seed=None,
save_to_dir=None,
save_prefix="",
save_format=None,
subset=None,
interpolation="nearest",
validate_filenames=True,
)
Useful resources for more info :
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator#flow_from_dataframe
Keras flowFromDirectory get file names as they are being generated
https://github.com/keras-team/keras-preprocessing/blob/6701f27afa62712b34a17d4b0ff879156b0c7937/keras_preprocessing/image/dataframe_iterator.py#L267
https://github.com/keras-team/keras-preprocessing/issues/92

df['image'] = df['image']+'.jpg'
Use the above code to convert the names of the images into filenames before inputting "image" in x_col parameter in the flow_from_dataframe function. This is what worked for me.

Related

Reading a set of HDF5 files and then slicing the resulting datasets without storing them in the end

I think some of my question is answered here:1
But the difference that I have is that I'm wondering if it is possible to do the slicing step without having to re-write the datasets to another file first.
Here is the code that reads in a single HDF5 file that is given as an argument to the script:
with h5py.File(args.H5file, 'r') as df:
print('Here are the keys of the input file\n', df.keys())
#interesting point here: you need the [:] behind each of these and we didn't need it when
#creating datasets not using the 'with' formalism above. Adding that even handled the cases
#in the 'hits' and 'truth_hadrons' where there are additional dimensions...go figure.
jetdset = df['jets'][:]
haddset = df['truth_hadrons'][:]
hitdset = df['hits'][:]
Then later I do some slicing operations on these datasets.
Ideally I'd be able to pass a wild-card into args.H5file and then the whole set of files, all with the same data formats, would end up in the three datasets above.
I do not want to store or make persistent these three datasets at the end of the script as the output are plots that use the information in the slices.
Any help would be appreciated!

There are at least 2 ways to access multiple files:
If all files follow a naming pattern, you can use the glob
module. It uses wildcards to find files. (Note: I prefer
glob.iglob; it is an iterator that yields values without creating a list. glob.glob creates a list which you frequently don't need.)
Alternatively, you could input a list of filenames and loop on
the list.
Example of iglob:
import glob
for fname in glob.iglob('img_data_0?.h5'):
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Example with a list of names:
filenames = [ 'img_data_01.h5', 'img_data_02.h5', 'img_data_03.h5' ]
for fname in filenames:
with h5py.File(fname, 'r') as h5f:
print('Here are the keys of the input file\n', h5.keys())
Next, your code mentions using [:] when you access a dataset. Whether or not you need to add indices depends on the object you want returned.
If you include [()], it returns the entire dataset as a numpy array. Note [()] is now preferred over [:]. You can use any valid slice notation, e.g., [0,0,:] for a slice of a 3-axis array.
If you don't include [:], it returns a h5py dataset object, which
behaves like a numpy array. (For example, you can get dtype and shape, and slice the data). The advantage? It has a smaller memory footprint. I use h5py dataset objects unless I specifically need an array (for example, passing image data to another package).
Examples of each method:
jets_dset = h5f['jets'] # w/out [()] returns a h5py dataset object
jets_arr = h5f['jets'][()] # with [()] returns a numpy array object
Finally, if you want to create a single array that merges values from 3 datasets, you have to create an array big enough to hold the data, then load with slice notation. Alternatively, you can use np.concatenate() (However, be careful, as concatenating a lot of data can be slow.)
A simple example is shown below. It assumes you know the shape of the dataset, and they are the same for all 3 files. (a0, a1 are the axes lengths for 1 dataset) If you don't know them, you can get them from the .shape attribute
Example for method 1 (pre-allocating array jets3x_arr):
a0, a1 = 100, 100
jets3x_arr = np.empty(shape=(a0, a1, 3)) # add dtype= if not float
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
jets3x_arr[:,:,cnt] = h5f['jets']
Example for method 2 (using np.concatenate()):
a0, a1 = 100, 100
for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')):
with h5py.File(fname, 'r') as h5f:
if cnt == 0:
jets3x_arr= h5f['jets'][()].reshape(a0,a1,1)
else:
jets3x_arr= np.concatenate(\
(jets3x_arr, h5f['jets'][()].reshape(a0,a1,1)), axis=2)

How to inspect values in binarized FairSeq datasets?

Running the fairseq-preprocess script produces binary files with integer indices corresponding to token ids in a dictionary.
When I no longer have the original tokenized texts, what is the simplest way to explore the binarized dataset? The documentation does not say much about how a dataset can be loaded for debugging purposes.

I worked around this by loading the trained model and using it to decode the binarized sentences back to strings:
from fairseq.models.transformer import TransformerModel
model_dir = ???
data_dir = ???
model = TransformerModel.from_pretrained(
model_dir,
checkpoint_file='checkpoint_best.pt',
data_name_or_path=data_dir,
bpe='sentencepiece',
sentencepiece_model=model_dir + '/sentencepiece.joint.bpe.model'
)
model.task.load_dataset('train')
data_bin = model.task.datasets['train']
train_pairs = [
(model.decode(item['source']), model.decode(item['target']))
for item in data_bin
]

MultiOutput Classification with TensorFlow Extended (TFX)

I'm quite new to TFX (TensorFlow Extended), and have been going through the sample tutorial on the TensorFlow portal to understand a bit more to apply it to my dataset.
In my scenario, instead of predicting a single label, the problem at hand requires me to predict 2 outputs (category 1, category 2).
I've done this using pure TensorFlow Keras Functional API and that works fine, but then am now looking to see if that can be fitted into the TFX pipeline.
Where i get the error, is at the Trainer stage of the pipeline, and where it throws the error is in the _input_fn, and i suspect it's because i'm not correctly splitting out the given data into (features, labels) tensor pair in the pipeline.
Scenario:
Each row of the input data comes in the form of
[Col1, Col2, Col3, ClassificationA, ClassificationB]
ClassificationA and ClassificationB are the categorical labels which i'm trying to predict using the Keras Functional Model
The output layer of the keras functional model looks like below, where there's 2 outputs that is joined to a single dense layer (Note: _xf appended to the end is just to illustrate that i've encoded the classes to int representations)
output_1 = tf.keras.layers.Dense(
TargetA_Class, activation='sigmoid',
name = 'ClassificationA_xf')(dense)
output_2 = tf.keras.layers.Dense(
TargetB_Class, activation='sigmoid',
name = 'ClassificationB_xf')(dense)
model = tf.keras.Model(inputs = inputs,
outputs = [output_1, output_2])
In the trainer module file, i've imported the required packages at the start of the module file >
import tensorflow_transform as tft
from tfx.components.tuner.component import TunerFnResult
import tensorflow as tf
from typing import List, Text
from tfx.components.trainer.executor import TrainerFnArgs
from tfx.components.trainer.fn_args_utils import DataAccessor, FnArgs
from tfx_bsl.tfxio import dataset_options
The current input_fn in the trainer module file looks like the below (by following the tutorial)
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
return data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
#label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]),
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]), _transformed_name(_CATEGORICAL_LABEL_KEYS[1])),
tf_transform_output.transformed_metadata.schema)
When i run the trainer component the error that comes up is:
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]),transformed_name(_CATEGORICAL_LABEL_KEYS1)),
^ SyntaxError: positional argument follows keyword argument
I've also tried label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]) which also gives an error.
However, if i just pass in a single label key, label_key=transformed_name(_CATEGORICAL_LABEL_KEYS[0]) then it works fine.
FYI - _CATEGORICAL_LABEL_KEYS is nothing but a list which contains the names of the 2 outputs i'm trying to predict (ClassificationA, ClassificationB).
transformed_name is nothing but a function to return an updated name/key for the transformed data:
def transformed_name(key):
return key + '_xf'
Question:
From what i can see, the label_key argument for dataset_options.TensorFlowDatasetOptions can only accept a single string/name of label, which means it may not be able to output the dataset with multi labels.
Is there a way which i can modify the _input_fn so that i can get the dataset that's returned by _input_fn to work with returning the 2 output labels? So the tensor that's returned looks something like:
Feature_Tensor: {Col1_xf: Col1_transformedfeature_values, Col2_xf:
Col2_transformedfeature_values, Col3_xf:
Col3_transformedfeature_values}
Label_Tensor: {ClassificationA_xf: ClassA_encodedlabels,
ClassificationB_xf: ClassB_encodedlabels}
Would appreciate advice from the wider community of tfx!

Since the label key is optional, maybe instead of specifying it in the TensorflowDatasetOptions, instead you can use dataset.map afterwards and pass both labels after taking them from your dataset.
Haven't tested it but something like:
def _data_augmentation(feature_dict):
features = feature_dict[_transformed_name(x) for x in
_CATEGORICAL_FEATURE_KEYS]]
keys=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]
return features, keys
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
dataset = data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
tf_transform_output.transformed_metadata.schema)
dataset = dataset.map(_data_augmentation)
return dataset

Extracting specific data from a [pandas.core.frame.DataFrame] variable

While extracting data from a .csv file using pandas, I wanted to collect the labels of various columns in that file. Instead of hardcoding, I was trying to extract it from the variable I created from the code below:
train_data = pd.read_csv("Anydatasheet.csv")
features = ["Pclass","Age", "Fare", "Parch", "SibSp","Sex","Embarked"]
X = pd.get_dummies(train_data[features])
X.head()
(By labels above, I mean the bold text circled in the image attached)
Can anyone tell me how to do it?
(Image data source : Kaggle titanic problem data)
enter image description here

What you are looking for is the columns names. You can read them directly:
train_data = pd.read_csv("Anydatasheet.csv")
features_name = train_data.columns
or if you want them as python regular list:
train_data = pd.read_csv("Anydatasheet.csv")
features_name = train_data.columns.tolist()
example:
import pandas as pd
df = pd.DataFrame({"city":[1,1], "A":[5,6]})
print(df.columns.tolist())
output:
['city', 'A']

How to bulk test the Sagemaker Object detection model with a .mat dataset or S3 folder of images?

I have trained the following Sagemaker model: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/introduction_to_amazon_algorithms/object_detection_pascalvoc_coco
I've tried both the JSON and RecordIO version. In both, the algorithm is tested on ONE sample image. However, I have a dataset of 2000 pictures, which I would like to test. I have saved the 2000 jpg pictures in a folder within an S3 bucket and I also have two .mat files (pics + ground truth). How can I apply this model to all 2000 pictures at once and then save the results, rather than doing it one picture at a time?
I am using the code below to load a single picture from my S3 bucket:
object = bucket.Object('pictures/pic1.jpg')
object.download_file('pic1.jpg')
img=mpimg.imread('pic1.jpg')
img_name = 'pic1.jpg'
imgplot = plt.imshow(img)
plt.show(imgplot)
with open(img_name, 'rb') as image:
f = image.read()
b = bytearray(f)
ne = open('n.txt','wb')
ne.write(b)
import json
object_detector.content_type = 'image/jpeg'
results = object_detector.predict(b)
detections = json.loads(results)
print (detections['prediction'])

I'm not sure if I understood your question correctly. However, if you want to feed multiple images to the model at once, you can create a multi-dimensional array of images (byte arrays) to feed the model.
The code would look something like this.
import numpy as np
...
# predict_images_list is a Python list of byte arrays
predict_images = np.stack(predict_images_list)
with graph.as_default():
# results is an list of typical results you'd get.
results = object_detector.predict(predict_images)
But, I'm not sure if it's a good idea to feed 2000 images at once. Better to batch them in 20-30 images at a time and predict.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Way to print invalid filenames for generator in keras - keras

df['image'] = df['image']+'.jpg' Use the above code to convert the names of the images into filenames before inputting "image" in x_col parameter in the flow_from_dataframe function. This is what worked for me.

Related

Reading a set of HDF5 files and then slicing the resulting datasets without storing them in the end

How to inspect values in binarized FairSeq datasets?

MultiOutput Classification with TensorFlow Extended (TFX)

Extracting specific data from a [pandas.core.frame.DataFrame] variable

How to bulk test the Sagemaker Object detection model with a .mat dataset or S3 folder of images?

Categories

Resources