Grouping arrays with common classes for classification in CNN - python-3.x

I have a data set with three columns,the first two columns are the features and the third column contain classes,there are 4 classes,part of it can be seen here.
The data set is big,lets say 100,000 rows and 3 columns(two column features and one column for classes),so I am using a moving window of length 50 on the data set before training my deep learning model. So far I have tried two different method to slice the data set with no good results and I am pretty sure my data set is good.
I first used a moving window on my entire data set,resulting into 2000 data samples with 50 rows and 2 columns(2000,50,2). As some data samples contain mixed classes,I selected only data samples with common classes and find the average of the classes to assign that particular data sample into a single class only,I have not get results with this.Here are my codes,`
def moving_window(data_, length, step=1):
streams = it.tee(data_, length)
return zip(*[it.islice(stream, i, None, step * length) for stream, i in zip(streams, it.count(step=step))])
data = list(moving_window(data_, 50))
data = np.asarray(data)
# print(len(data))
for i in data:
label=np.all(i==i[0,2],axis=0)
if label[2]==True:
X.append(i[:,0:2])
Y.append(sum(i[:,2])/len(i[:,2]))`
I tried another way by collecting only features corresponding to a particular class,putting the values into separate lists(4 lists as I have 4 classes) then used a moving window to slice each list separately and assign to its class. No good results too.Here are my codes.
for i in range(5):
labels.append(i)
yy= pd.get_dummies(labels)
yy= yy.values
yy= yy.astype(np.float32)
def moving_window(x, length, step=1):
streams = it.tee(x, length)
return zip(*[it.islice(stream, i, None, step * length) for stream, i in zip(streams, it.count(step=step))])
x_1 = list(moving_window(x1, 50))
x_1 = np.asarray(x_1)
y_1 = [yy[0]] * len(x_1)
X.append(x_1)
Y.append(y_1)
# print(x_1.shape)
x_2 = list(moving_window(x2, 50))
x_2 = np.asarray(x_2)
# print(yy[1])
y_2 = [yy[1]] * len(x_2)
X.append(x_2)
Y.append(y_2)
# print(x_2.shape)
x_3 = list(moving_window(x3, 50))
x_3 = np.asarray(x_3)
# print(yy[2])
y_3 = [yy[2]] * len(x_3)
X.append(x_3)
Y.append(y_3)
# print(x_3.shape)
x_4 = list(moving_window(x4, 50))
x_4 = np.asarray(x_4)
# print(yy[3])
y_4 = [yy[3]] * len(x_4)
X.append(x_4)
Y.append(y_4)
# print(x_4.shape)
the architecture of the model which I am trying to train works perfect with other data set. So I think I am missing something on how I process the data.What am I missing on my ways of processing the data before I start training?,is there any other way?. All the work done is in python.

I finally managed to train my CNN model and achieved good training,validation and testing accuracy. The only thing I added was normalization of my input data with the following lines,
minmax_scale = preprocessing.MinMaxScaler().fit(x)
X = minmax_scale.transform(x)
The rest remains the same.

Related

keras data generator for multi task learning with non image data format

I am working on a multi-task semantic segmentation problem with three decoders and thus, I need to feed three inputs and have three outputs. Furthermore, my datasets are not image formats(.jpg, ...) but they are .mat and .npy formats. My labels are having three values of 0,1,2 (maps with the same shape as my grayscale images). With these two in mind, I am trying to load the dataset using keras generators as my dataset is very large. Below is what I have tried based on keras documentation for generators, but to my knowledge, the documentation assumes the data as images and single task network. How can I adjust my code so that I can generators for multiple tasks and multiple data formats (non-image)?
def batch_generator(X_gen,Y_gen, amp_gen, phase_gen):
while true:
yield(X_gen.next(),Y_gen.next(), map1_gen.next(), map2_gen.next())
where map1_gen and map2_gen are supposed to be generators for the other two inputs (maps).
train_images_dir = ''
train_masks_dir = ''
train_map1_dir = ''
train_map2_dir = ''
val_images_dir = ''
val_masks_dir = ''
val_map1_dir = ''
val_map2_dir = ''
datagen = ImageDataGenerator()
train_images_generator = datagen.flow_from_directory(train_images_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
train_mask_generator = datagen.flow_from_directory(train_masks_dir,target_size=(Img_Length,Img_Height, num_classes),batch_size=1,class_mode='categorical')
train_map1_generator = datagen.flow_from_directory(train_map1_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
train_map2_generator = datagen.flow_from_directory(train_map2_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size ,class_mode=None)
#val augumentation.
val_images_generator = datagen.flow_from_directory(val_images_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
val_masks_generator = datagen.flow_from_directory(val_masks_dir,target_size=(Img_Length,Img_Height, num_classes),batch_size=1,class_mode='categorical')
val_map1_generator = datagen.flow_from_directory(val_map1_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
val_map2_generator = datagen.flow_from_directory(val_map2_dir,target_size=(Img_Length,Img_Height),batch_size=batch_size,class_mode=None)
model = ...
model.fit_generator(batch_generator(train_images_generator,train_mask_generator, train_map1_generator, train_map2_generator), validation_data=batch_generator(val_images_generator,val_masks_generator, val_map1_generator, val_map2_generator),callbacks=...)
The outputs of each decoder is supposed to be (Img_Length,Img_Height) segmentation map with three labels 0,1,2; map1 and map2 outputs with (Img_Length,Img_Height) size of linear values respectively.
You could try to implement a custom generator and dismiss the ImageDataGenerator completely. E.g.
def batch_generator(batchsize):
while True:
inputs1 = []
inputs2 = []
inputs3 = []
outputs1 = []
outputs2 = []
outputs3 = []
for _ in batchsize:
input1 = cv2.imread(img1) #or whatever
inputs1.append(input1)
inputs2.append(...)
...
# you may have to convert the lists into numpy arrays
yield([inputs1,inputs2,inputs3],[outputs1,outputs2,outputs3])
Basically, you directly yield a list of all your inputs and outputs, each of them being a batch.
But that means you would have to manually read them in but I think that makes sense considering you have some non-image datatypes.
You can then pass this generator to model.fit_generator (or just model.fit since tensorflow2)
model.fit_generator(batch_generator(batchsize))

Best practices for high performance input pipeline using only tf.data API (no feed_dict)

The official TensorFlow Performance Guide states the following:
While feeding data using a feed_dict offers a high level of
flexibility, in general feed_dict does not provide a scalable
solution. If only a single GPU is used, the difference between the
tf.data API and feed_dict performance may be negligible. Our
recommendation is to avoid using feed_dict for all but trivial
examples. In particular, avoid using feed_dict with large inputs.
However, avoiding the use of feed_dict entirely appears to be impossible. Consider the following setup with train, validation, and test datasets.
ds = tf.data.Dataset
n_files = 1000 # total number of tfrecord files
split = int(.67 * n_files)
files = ds.zip((ds.range(n_files),ds.list_files("train/part-r-*")))
train_files = files.filter(lambda a, b: a < split).map(lambda a,b: b)
validation_files = files.filter(lambda a, b: a >= split).map(lambda a,b: b)
test_files = ds.list_files("test/part-r-*")
A common method to parse the datasets might look like the following:
def setup_dataset(self, file_ds, mode="train"):
data = file_ds.apply(tf.contrib.data.parallel_interleave(
tf.data.TFRecordDataset,
cycle_length=4,
sloppy=True,
buffer_output_elements=self.batch_size * 8,
prefetch_input_elements=self.batch_size * 8
))
if mode == "train":
data = data.map(self.train_data_parser)
else:
data = data.map(self.test_data_parser)
return data
Then instead of feeding the individual features through a feed_dict in session.run(), you would create a reusable iterator with either Iterator.from_structure() or Iterator.from_string_handle(). I will show an example with the former, but you run into the same problem either way.
train = self.setup_dataset(train_files)
self.ops["template_iterator"] = tf.data.Iterator.from_structure(train.output_types, train.output_shapes)
self.ops["next_batch"] = self.ops["template_iterator"].get_next(name="next_batch")
self.ops["train_init"] = self.ops["template_iterator"].make_initializer(train)
validation = self.setup_dataset(validation_files)
self.ops["validation_init"] = self.ops["template_iterator"].make_initializer(validation)
This all works great, but what am I supposed to do with the test dataset? The test dataset will not contain the label feature(s) and therefore not conform to the same output_types and output_shapes as the train and validation datasets.
I would ideally like to restore from a SavedModel and initialize the test dataset rather than serve the model over an API.
What is the trick that I am missing to incorporate test dataset during inference?
I have my datasets and iterators set up for training and inference like this:
# Train dataset
images_train = tf.placeholder(tf.float32, train_images.shape)
labels_train = tf.placeholder(tf.float32, train_masks.shape)
dataset_train = tf.data.Dataset.from_tensor_slices({"images": images_train, "masks": labels_train})
dataset_train = dataset_train.batch(MINIBATCH)
dataset_train = dataset_train.map(lambda x: map_helper(x, augmentation), num_parallel_calls=8)
dataset_train = dataset_train.shuffle(buffer_size=10000)
iterator_train = tf.data.Iterator.from_structure(dataset_train.output_types, dataset_train.output_shapes)
training_init_op = iterator_train.make_initializer(dataset_train)
batch_train = iterator_train.get_next()
# Inference dataset
images_infer = tf.placeholder(tf.float32, shape=[None] + list(valid_images.shape[1:]))
labels_infer = tf.placeholder(tf.float32, shape=[None] + list(valid_masks.shape[1:]))
dataset_infer = tf.data.Dataset.from_tensor_slices({"images": images_infer, "masks": labels_infer})
dataset_infer = dataset_infer.batch(MINIBATCH)
iterator_infer = tf.data.Iterator.from_structure(dataset_infer.output_types, dataset_infer.output_shapes)
infer_init_op = iterator_infer.make_initializer(dataset_infer)
batch_infer = iterator_infer.get_next()
Training
Initialise the iterator for training using training_init_op
sess.run(training_init_op, feed_dict={images_train: train_images, labels_train: train_masks})
Validation
Initialise the inference iterator for validation using infer_init_op
sess.run(infer_init_op, feed_dict={images_infer: images_val, labels_infer: masks_val})
Test
Initialise the inference iterator for testing using infer_init_op. This is a bit hacky, but I create an array with zeros where the labels would go and use the same iterator I used for validation
sess.run(infer_init_op, feed_dict={images_infer: images_test, labels_infer: np.zeros(images_test.shape)})
Alternatively, you could create 3 different dataset/iterators for train/validation/test

Changing width of heatmap in Seaborn to compensate for font size reduction

I have a sentence like say
Hey I am feeling pretty boring today and the day is dull too
I pass it through the openai sentiment code which gives me some neuron weights which can be equal or little greater then number of words.
Neuron weights are
[ 0.01258736, 0.03544582, 0.08490616, 0.09010842, 0.07180552,
0.07271874, 0.08906463, 0.09690772, 0.10281454, 0.08131664,
0.08315734, 0.0790544 , 0.07770097, 0.07302617, 0.07329235,
0.06856266, 0.07642639, 0.08199468, 0.09079508, 0.09539193,
0.09061056, 0.07109602, 0.02138061, 0.02364372, 0.00322057,
0.01517018, 0.01150052, 0.00627739, 0.00445003, 0.00061127,
0.0228037 , -0.29226044, -0.40493113, -0.4069235 , -0.39796737,
-0.39871565, -0.39242673, -0.3537892 , -0.3779315 , -0.36448184,
-0.36063945, -0.3506464 , -0.36719123, -0.37997353, -0.35103855,
-0.34472692, -0.36256564, -0.35900915, -0.3619383 , -0.3532831 ,
-0.35352525, -0.33328298, -0.32929575, -0.33149993, -0.32934144,
-0.3261477 , -0.32421976, -0.3032671 , -0.47205922, -0.46902984,
-0.45346943, -0.4518705 , -0.50997925, -0.50997925]
Now what I wanna do is plot a heatmap , the positive values shows positive sentiments while negative ones shows negative sentiment and I am plotting the heat map but the heatmap isn't plotting like it should be
But when the sentence gets longer the whole sentence gets smaller and smaller that can't be seen ,So what changes should I do to make it show better.
Here is my plotting function:
def plot_neuron_heatmap(text, values, savename=None, negate=False, cell_height=.112, cell_width=.92):
#n_limit = 832
cell_height=.325
cell_width=.15
n_limit = count
num_chars = len(text)
text = list(map(lambda x: x.replace('\n', '\\n'), text))
num_chars = len(text)
total_chars = math.ceil(num_chars/float(n_limit))*n_limit
mask = np.array([0]*num_chars + [1]*(total_chars-num_chars))
text = np.array(text+[' ']*(total_chars-num_chars))
values = np.array((values+[0])*(total_chars-num_chars))
values = values.reshape(-1, n_limit)
text = text.reshape(-1, n_limit)
mask = mask.reshape(-1, n_limit)
num_rows = len(values)
plt.figure(figsize=(cell_width*n_limit, cell_height*num_rows))
hmap=sns.heatmap(values, annot=text, mask=mask, fmt='', vmin=-5, vmax=5, cmap='RdYlGn',xticklabels=False, yticklabels=False, cbar=False)
plt.subplots_adjust()
#plt.tight_layout()
plt.savefig('fig1.png')
#plt.show()
This is how it shows the lengthy text as
What I want it to show
Here is a link to the full notebook: https://github.com/yashkumaratri/testrepo/blob/master/heatmap.ipynb
Mad Physicist , Your code does this
and what really it should do is
The shrinkage of the font you are seeing is to be expected. As you add more characters horizontally, the font shrinks to fit everything in. There are a couple of solutions for this. The simplest would be to break your text into smaller chunks, and display them as you show in your desired output. Also, you can print your figure with a different DPI with what is shown on the screen, so that the letters will look fine in the image file.
You should consider cleaning up your function along the way:
count appears to be a global that is never used.
You redefine variables without ever using the original value (e.g. num_chars and the input parameters).
You have a whole bunch of variables you don't really use.
You recompute a lot of quantities multiple times.
The expression list(map(lambda x: x.replace('\n', '\\n'), text)) is total overkill: list(text.replace('\n', '\\n')) does the same thing.
Given that len(values) != len(text) for most cases, the line values = np.array((values+[0])*(total_chars-num_chars)) is nonsense and needs cleanup.
You are constructing numpy arrays by doing padding operations on lists, instead of using the power of numpy.
You have the entire infrastructure for properly reshaping your arrays already in place, but you don't use it.
The updated version below fixes the minor issues and adds n_limit as a parameter, which determines how many characters you are willing to have in a row of the heat map. As I mentioned in the last item, you already have all the necessary code to reshape your arrays properly, and even mask out the extra tail you end up with sometimes. The only thing that is wrong is the -1 in the shape, which always resolves to one row because of the remainder of the shape. Additionally, the figure is always saved at 100dpi, so the results should come out consistent for a given width, no matter how many rows you end up with. The DPI affects PNG because it increases or decreases the total number of pixels in the image, and PNG does not actually understand DPI:
def plot_neuron_heatmap(text, values, n_limit=80, savename='fig1.png',
cell_height=0.325, cell_width=0.15, dpi=100):
text = text.replace('\n', '\\n')
text = np.array(list(text + ' ' * (-len(text) % n_limit)))
if len(values) > text.size:
values = np.array(values[:text.size])
else:
t = values
values = np.zeros(text.shape, dtype=np.int)
values[:len(t)] = t
text = text.reshape(-1, n_limit)
values = values.reshape(-1, n_limit)
# mask = np.zeros(values.shape, dtype=np.bool)
# mask.ravel()[values.size:] = True
plt.figure(figsize=(cell_width * n_limit, cell_height * len(text)))
hmap = sns.heatmap(values, annot=text, fmt='', vmin=-5, vmax=5, cmap='RdYlGn', xticklabels=False, yticklabels=False, cbar=False)
plt.subplots_adjust()
plt.savefig(savename if savename else 'fig1.png', dpi=dpi)
Here are a couple of sample runs of the function:
text = 'Hey I am feeling pretty boring today and the day is dull too'
values = [...] # The stuff in your question
plot_neuron_heatmap(text, values)
plot_neuron_heatmap(text, values, 20)
plot_neuron_heatmap(text, values, 7)
results in the following three figures:

how do i check if a data set is normal or not in python?

So I'm creating a master program for machine learning from scratch in python and the first step i want to do is to check if the data set is normal or not.
ps : the data set can have many features or just a single feature.
It has to be implemented in python3.
also, normalizing the data can be done by the below function right :
# Find the min and max values for each column
def dataset_minmax(dataset):
minmax = list()
for i in range(len(dataset[0])):
col_values = [row[i] for row in dataset]
value_min = min(col_values)
value_max = max(col_values)
minmax.append([value_min, value_max])
return minmax
# Rescale dataset columns to the range 0-1
def normalize_dataset(dataset, minmax):
for row in dataset:
for i in range(len(row)):
row[i] = (row[i] - minmax[i][0]) / (minmax[i][1] - minmax[i][0])
THANKS IN ADVANCE!
Your question seems discordant: if your features are not coming from a normal distribution, you cannot "normalize" them, in the sense of changing their distribution. If you mean to check if they have average 0 and SD of 1 that is a different ballpark game.

How to merger NaiveBayesClassifier object in NLTK

I am working on a project using the NLTK toolkit. With the hardware I have, I am able to run the classifier object on a small data set. So, I divided the data into smaller chunks and running the classifier object in them while storing all these individual object in a pickle file.
Now for testing I need to have the whole object as one to get better result. So my question is how can I combine these objects into one.
objs = []
while True:
try:
f = open(picklename,"rb")
objs.extend(pickle.load(f))
f.close()
except EOFError:
break
Doing this does not work. And it gives the error TypeError: 'NaiveBayesClassifier' object is not iterable.
NaiveBayesClassifier code :
classifier = nltk.NaiveBayesClassifier.train(training_set)
I am not sure about the exact format of your data, but you can not simply merge different classifiers. The Naive Bayes classifier stores a probability distribution based on the data it was trained on, and you can not merge probability distributions without access to the original data.
If you look at the source code here: http://www.nltk.org/_modules/nltk/classify/naivebayes.html
an instance of the classifier stores:
self._label_probdist = label_probdist
self._feature_probdist = feature_probdist
these are calculated in the train method using relative frequency counts. (e.g P(L_1) = (# of L1 in training set) / (# labels in training set). To combine the two, you would want to get (# of L1 in Train 1 + Train 2)/(# of labels in T1 + T2).
However, the naive bayes procedure isn't too hard to implement from scratch, especially if you follow the 'train' source code in the link above. Here is an outline, using the NaiveBayes source code
Store 'FreqDist' objects for each subset of the data for the labels and features.
label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()
# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
label_freqdist[label] += 1
for fname, fval in featureset.items():
# Increment freq(fval|label, fname)
feature_freqdist[label, fname][fval] += 1
# Record that fname can take the value fval.
feature_values[fname].add(fval)
# Keep a list of all feature names.
fnames.add(fname)
# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.' This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
num_samples = label_freqdist[label]
for fname in fnames:
count = feature_freqdist[label, fname].N()
# Only add a None key when necessary, i.e. if there are
# any samples with feature 'fname' missing.
if num_samples - count > 0:
feature_freqdist[label, fname][None] += num_samples - count
feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values
Combine those using their built-in 'add' method. This will allow you to get the relative frequency across all the data.
all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)
for file in train_labels:
f = open(file,"rb")
all_label_freqdist += pickle.load(f)
f.close()
# Combine the default dicts for features similarly
Use the 'estimator' to create a probability distribution.
estimator = ELEProbDist()
label_probdist = estimator(all_label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
feature_probdist[label, fname] = probdist
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
The classifier will not combine the counts across all the data and produce what you need.

Resources