K-fold crossvalidation on images - keras

Let's say I have some pictures divided in 3 categories ("cat", "dog", "mouse") and my DL net is written in keras.
The design I used is the same as in this picture (1):
I splitted the data into three different folders: training, validation and test.
The net should be able to recognize a cat, dog or a mouse given a picture. The accuracy I get is around 98%.
It works.
But I need for some reasons change that design. I would like to use the K-fold cross-validation process and the schema should now look like (2):
Now my problem is that I don't know how to split and distribute the original data according to the schema in Fig. 2.
I can only imagine 2 different ways. Let's forget the test directory for the moment:
I create 2 folders: "Training" and "Validation". In both is the structure the same as in Fig. 1: Three subdirectory for every categories. Now the problem is: should I move the data around when progressing from Fold 1 to Fold 3? Or I can allocate once the images into the subdirectories?
I create 2 folders: "Training" and "Validation", BUT I mix all images togheter. No subdirectory. In this case I have the problem that I lose the connection between the picture name and the pet on it. How can I tell Keras, which animal should be identified?
Personally I would mix all images togheter, no matter what they show. But I would save the information of the content into a file. In this case I pass to Keras the directory (Validation or Training) and a file containing the name of all files and their content.
What would you suggest?

Ok, I can answer my own question.
The easiest way is just to user Kfold form sklearn in the python script
from sklearn.model_selection import KFold
After that you need to istantiate KFold
kfold = KFold(n_splits = 4, shuffle = True)
and you iterate over the splitted dataset like:
datagen = ImageDataGenerator(rescale = 1. / 255.)
for train, test in kfold.split(df_data):
# df is the whole dataset (all together!)
df_train = df.iloc[train, :] # Look that train is coming from the for in .. loop
df_test = df.iloc[test, :] # The same for test
train_generator = datagen.flow_from_dataframe(dataframe = df_train,
directory = dataset_dir,
... )
test_generator = datagen.flow_from_dataframe(dataframe = df_test,
directory = dataset_dir,
...)
model = models.Sequential()
.....
model.compile(...)
model.fit(...)
and it is done! The dataset is now splitted in partitions!!!
Note that the class ImageDataGenerator is not in the for loop!!!
And note please, that the methods (model creation, compile() and fit()) must be in the for-loop.
The code above works for me very good.

Related

Load several Images without label in keras cnn

I have several .jpeg images with different names, that I want to load into a cnn in a jupyter notebook to have them classified. The only way I found was:
test_image = image.load_img("name_of_picute.jpeg",target_size=(64,64))
test_image = image.img_to_array(test_image)
test_image = np.expand_dims(test_image, axis=0)
result = cnn.predict(test_image)
All the other things found at the Keras API like tf.keras.preprocessing.image_dataset_from_directory()seems to only work on labeled data. Sadly I can't "simply" iterate over the name of the pictures a they are named differently, is there a way to predict all of them at once without naming every single picture?
Thanks for yout help,
Nick
The solutiontf.keras.preprocessing.image_dataset_from_directory can be updated to return the dataset and the image_path as explained here -> https://stackoverflow.com/a/63725072/4994352
There are multiple ways, for larger data it is useful to use a tf.data.DataSet as it can be tweaked for performance quite easily. I will give you the non-performance-optimized code. Replace <YOUR PATH INCL. REGEX> with the path like ../input/pokemon-images-and-types/images/*/*.
import tensorflow as tf
from tensorflow.data.experimental import AUTOTUNE
def load(file_path):
img = tf.io.read_file(file_path)
img = tf.image.decode_jpeg(img, channels=3)
... # do some preprocessing like resizing if necessary
return img
list_ds = tf.data.Dataset.list_files(str('<YOUR PATH INCL. REGEX>'), shuffle=True) # Get all images from subfolders
train_dataset = list_ds.take(-1)
# Set `num_parallel_calls` so multiple images are loaded/processed in parallel.
train_dataset = train_dataset.map(load, num_parallel_calls=AUTOTUNE)

Capture output of ImageDataGenerator without saving images to drive?

I'm using image augmentation to train my model, applying transforms like brightness and color shifts. I like to preview the effects of the augmentations before putting them into use. Normally, I do it like this:
from keras_preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(horizontal_flip = True,
fill_mode = "nearest",
zoom_range = 0.3,
rotation_range=360)
i = 0
for batch in datagen.flow_from_directory(directory='./my_images/,
batch_size = 1,
save_to_dir='.',
save_prefix='aug',
save_format='jpeg'):
i += 1
if i > 2:
break # Yields two images
Then I open the images and look at them, or read them in and print them to my notebook. But I don't like this- it's clunky. Is there a way to directly capture the altered images produced by my generator? I'd like to add them to an array.
Thanks!

How to deal with imbalanced classes in Keras

I am working on a multi-label image classification problem with Keras and so I utilize functions flow_from_dataframe() and fit_generator().
I have about 2000 classes and as you can guess they are highly skewed / imbalanced. After searching a bit I came across with arguments class_weight and classes and I decided to give them a try. My problem is, I am not sure if I use them correctly. Here is an example:
Let's assume that I have flatten all class occurrences so that I get the following list of (duplicated) labels:
labels = ['classD', 'classA', 'classA', 'classC', 'classD', 'classD']
And this is the function that computes classes and class_weight:
from collections import Counter
def get_classes_weights(l, n):
counter = Counter(l).most_common(n)
classes = [cls for cls, ocu in counter]
majority = max([ocu for cls, ocu in counter])
weights = {idx: float(majority/ocu) for idx, (cls, ocu) in enumerate(counter)}
return classes, weights
Let's also assume that I what to consider the top-2 classes only:
classes, class_weight = get_classes_weights(labels, 2)
This gives:
classes: ['classD', 'classA']
and:
class_weight: {0: 1.0, 1: 1.5}
And finally, this is how I use them within the functions:
generator_train.flow_from_dataframe(
classes=classes,
)
model.fit_generator(
class_weight=class_weight
)
So my question are:
Is the above the right way to apply weights given that I work on a multi-label image classification problem?
Does my validation set need to be balanced or it is OK if it has been taken from the same distribution as the training set (20% and 80% random selection, respectively)?

Train your own image with tensorflow?

I have one image ( i don't have dataset ) I want to train a model in tensorflow,
such that I can use that model to recognize the image fast.
I have implemented one such thing, but it doesn't work:
import tensorflow as tf
filenames = ['pic.jpg']
# step 2
filename_queue = tf.train.string_input_producer(filenames)
# step 3: read, decode and resize images
reader = tf.WholeFileReader()
filename, content = reader.read(filename_queue)
image = tf.image.decode_jpeg(content, channels=3)
image = tf.cast(image, tf.float32)
resized_image = tf.image.resize_images(image, [224, 224])
# step 4: Batching
image_batch = tf.train.batch([resized_image], batch_size=8)
Also, how vuforia is able to recognize with only one image so fast?. I want a similar implementation in tensorflow
This is not how machine learning and deep learning works. You can't just grab one element and build a model which explains this one element. If you will check a few NN tutorials, you will see that in order to train a reasonable model people use thousands or even millions of data points.

How to find key trees/features from a trained random forest?

I am using Scikit-Learn Random Forest Classifier and trying to extract the meaningful trees/features in order to better understand the prediction results.
I found this method which seems relevant in the documention (http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params), but couldn't find an example how to use it.
I am also hoping to visualize those trees if possible, any relevant code would be great.
Thank you!
I think you're looking for Forest.feature_importances_. This allows you to see what the relative importance of each input feature is to your final model. Here's a simple example.
import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#Lets set up a training dataset. We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1. We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features. If we do it right, the model should point out these three as important.
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
line = []
if random.random()>0.5:
line.append(1.0)
#Let's add 3 features that we know indicate a row classified as "1".
line.append(.77)
line.append(.33)
line.append(.55)
for x in range(16):#fill in the rest with noise
line.append(random.random())
else:
#this is a "0" row, so fill it with noise.
line.append(0.0)
for x in range(19):
line.append(random.random())
train_data.append(line)
train_data = np.array(train_data)
# Create the random forest object which will include all the parameters
# for the fit. Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)
# Fit the training data to the training output and create the decision
# trees. This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])
#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one. Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
if i>np.average(Forest.feature_importances_):
important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!
To get the relative feature importances, read the relevant section of the documentation along with the code of the linked examples in that same section.
The trees themselves are stored in the estimators_ attribute of the random forest instance (only after the call to the fit method). Now to extract a "key tree" one would first require you to define what it is and what you are expecting to do with it.
You could rank the individual trees by computing there score on held out test set but I don't know what expect to get out of that.
Do you want to prune the forest to make it faster to predict by reducing the number of trees without decreasing the aggregate forest accuracy?
Here is how I visualize the tree:
First make the model after you have done all of the preprocessing, splitting, etc:
# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Make predictions:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Then make the plot of importances. The variable dataset is the name of the original dataframe.
# get importances from RF
importances = classifier.feature_importances_
# then sort them descending
indices = np.argsort(importances)
# get the features from the original data set
features = dataset.columns[0:26]
# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
This yields a plot as below:

Resources