I am trying to write a small program which is able to classify certain pictures into categories. I create a list with pictures in the main code and pass them to the function in a loop. Code is working perfectly fine except for the fact that it does not free my memory and with each iteration the program uses more until it completely crashes.
Already tried to use "gc.collect()" in the function to force it to clear the memory, but that does not help. Shouldn't the memory be cleared automatically after checking one file or did I miss anything here?
def classify_pictures(self, files):
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
# Read the image_data
image_data = tf.gfile.FastGFile(files, 'rb').read()
# Loads label file, strips off carriage return
label_lines = [line.rstrip() for line
in tf.gfile.GFile("tf_files/retrained_labels.txt")]
# Unpersists graph from file
with tf.gfile.FastGFile("tf_files/retrained_graph.pb", 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
_ = tf.import_graph_def(graph_def, name='')
with tf.Session() as sess:
# Feed the image_data as input to the graph and get first prediction
softmax_tensor = sess.graph.get_tensor_by_name('final_result:0')
predictions = sess.run(softmax_tensor, \
{'DecodeJpeg/contents:0': image_data})
# Sort to show labels of first prediction in order of confidence
top_k = predictions[0].argsort()[-len(predictions[0]):][::-1]
for each_picture in range(0, 10):
human_string = label_lines[top_k[0]]
if human_string == "selfie":
return ("selfie")
if "passport" in human_string:
return("passport")
if "statement" in human_string:
return("bill")
If you call this function in a loop the computational graph is being rebuild on every iteration and the previous graph is of course still there too. That is what uses all the memory.
To solve this only do the session.run() call inside the loop.
In general when writing Tensorflow code you should always try to keep code that generates the graph separate from code which executes the graph. In your case you are doing both inside a single function, which is than called multiple times so a new graph is rebuild every time.
Related
My dataset code is like the below one; Here, X_test is a list[list] and y_test is list[Path]
The first.py file
self.test_dataset = LongDataset(
X_path=X_test,
y_path=y_test,
transform=val_transforms,
)
The rest of the part is as usual (dataloader)
def test_dataloader(self):
return DataLoader(self.test_dataset, batch_size=1, num_workers=8)
In the second.py file
The DataModule
data_module = DataModuleLong(batch_size=3,)
The Trainer
trainer = Trainer(gpus=1)
trainer.test(
model=model,
ckpt_path=ckpt_path,
datamodule=data_module,
)
The train_step() in the third.py file
def test_step(self, batch, batch_idx: int):
inputs, targets = batch
logits = self(inputs)
...
...
...
Now, is it possible to print (in the train_step()) the (inputs, targets) filename (or the full path) I am sending from test_dataset as (X_path, y_path)?
Essentially, what you want to do is get the index of each batch element in the batch returned by the dataloader object (from there it is trivial to index the dataset to get the desired data elements (in this case file paths).
Now the short answer is that there is no directly implemented way to return this data using the dataloader. However, there are a few workarounds:
Pass your own BatchSampler or Sampler object to the DataLoader constructor. Unfortunately there's not a simple way to query the Sampler for the current batch because it relies on generators (where yielding the next sample clears it and loades the next one. This is the same reason why you can't directly access the batch indices of the Dataloader. So to use this method, you'd have to pass a sampler wherein you know a priori which indices will be returned on the i-th query to the sampler. Not an ideal solution.
Create a custom dataset object - this is actually extremely easy to do, simply inherit from the torch.data object and implement the __init__, __len__ and __getitem__ methods. The __getitem__ method takes an index (let's say idx) as input and returns that index of the dataset. You can essentially copy the code for the existing LongDataset line for line, but simply append idx to the returned values from the __getitem__ method. I would demonstrate but you don't indicate where the LongDataset code comes from.
def __getitem__(self,idx):
... #load files, preprocess, etc.
return data, idx
Now dataloader will automatically zip the idx values for each, so you can simply replace the existing line with:
inputs, targets, indices = batch
data_paths = [self.test_dataset.file_paths[idx] for idx in indices]
The second solution is by far preferable as it is more transparently easy to understand.
When I try to save hyperopt.trials object, which contains information about auto params tuning in neural network,
best = fmin(fn = objective,
space = space,
algo = tpe.suggest, # or rand.suggest for random params selection
max_evals = max_trials,
trials = trials) #, rstate = np.random.RandomState(50)
pickle.dump(trials, open("neuro.hyperopt", "wb"))
it gives the error:
can't pickle _thread.RLock objects
Moreover, it loads on my local drive a file of 10GB size. That is, it saves not only the trials object, but the whole model.
Would you help me to save trials object with less size (e.g. the XGBoost trials file's size is 1Mb) and avoid the error.
Thank you.
In my case it was because the models stored in the trials were not pickle-able.
I tried to save tf.keras.optimizers.Adam(learning_rate = 0.001) object.
When I added the string 'Adam' instead, the error disapeared.
Of course, it creates another problem: how to setup learining rate for the optimizor. But it seems to be easier. One way is to replace the keras object with string in the trials.trials object before saving:
for trial in trials.trials:
if 'result' in trials.keys():
trials['result'].pop('model', None) # https://stackoverflow.com/questions/15411107/delete-a-dictionary-item-if-the-key-exists
# proceed with pickling
pickle.dump(trials, open("trials.pkl","wb"))
(I took it from here)
I have a script that uses the MTCNN face detection library that iterates through a fair amount of directories, totaling thousands of images. An issue that I've been running into with this script is the excessive memory usage when processing all of these images, which will eventually cause my MacBook (16gb of RAM) to run out of memory. What I'm looking to do is to implement batching on a folder by folder basis, instead of a specific batch limit because none of the folders contain enough images individually that would make the system run out of memory.
# open up the csv file
with open(csv_path, 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(['Index', 'Threshhold', 'Path'])
for path, subdirs, files in os.walk(path):
for name in files:
if name == '.DS_Store':
print("Skipping .DS_Store")
continue
else:
try:
image = os.path.join(path, name)
pixels = pyplot.imread(image)
print("Processing " + image)
print("Count: " + str(inc))
# calculate the area of the image
total_height = pixels.shape[0]
total_width = pixels.shape[1]
total_area = total_height * total_width
# create the detector, using default weights
detector = MTCNN()
faces = detector.detect_faces(pixels)
ax = pyplot.gca()
face_total_area = 0
if faces == []:
print("No faces detected.")
# pass in 0 for the threshold becuase there's no faces
#write_to_csv(inc, 0, image)
print()
else:
for face in faces:
# get dimensions from the face
x, y, width, height = face['box']
# calculate the area of the face
face_area = width * height
face_total_area += face_area
threshold = face_total_area / total_area
# write to csv only if the threshold is less than the limit
# change back to this eventually ^^^^^^^^^
if threshold > threshhold_limit:
print("Facial area is over the threshold - writing file path to csv.")
write_to_csv(inc, threshold, image)
else:
print("Image threshold is under the limit - good")
print(threshold)
print()
inc += 1
except:
print("Processing error - skipping image")
Is something like this possible to do? Or should it be done a different way? The idea is that batching like this will allow mtcnn to release the memory it's holding onto when it's done processing that folder.
Memory usage should not increase with this program, because it does not accumulate data from one image to the next one. So, what you are asking for will have no effect. Have you tried runnng tis same code outside of a Python notebook? As a standalone program? It may be that the notebook is keeping references to all read images.
Either that, or find a call that would really reset pyplot's internal state inside the innermost loop. (maybe pyplot.clf()).
"Batching" as you say is what takes place inside the first for loop, which will run once for each folder in your tree. The only bennefit you could possibly have would be to reset the internal state inside the first loop, but outside the second for (for name in ...), you'd have to find the exactly same call to reset the internal state.
(also, on a side note, you create a csv writer in your with block that is invalidated at the end of the block - you should refactor this code not to keep reopening the CSV file for each new line - (which happens in the not-shown write_to_csv function) )
I'm trying to read a directory of images files into tensorflow and I'm having a little trouble. When I run the script, the shell is just hanging (even waited 10 mins for an output) and the only output I get is the from the print(len(os.listdir()) line
My attempts stem from this guide:
https://www.tensorflow.org/api_guides/python/reading_data
import tensorflow as tf
import os
os.chdir(r'C:\Users\Moondra\Desktop\Testing')
print(len(os.listdir())) # only 2 in this case
file_names =tf.constant(os.listdir(), tf.string)
file_tensors =tf.train.string_input_producer(string_tensor = file_names)
reader =tf.WholeFileReader()
key, value = reader.read(file_tensors)
##features = tf.parse_single_example(value)
#records = reader.num_records_produced()
with tf.Session() as sess:
values =sess.run(value)
##print(records_num)
print(type(values))
The reader is supposed to read images one at a time, so I'm assuming
value will hold the image values on on the current image.
Despite this, the shell is just hanging and no output.
Thank you and --just incase #mrry is available.
tf.train.string_input_producer adds a QueueRunner to the current Graph and you need to manually start it. Otherwise it is just hanging and no output is produced.
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
values = sess.run(value)
# ....
I am working on a project using the NLTK toolkit. With the hardware I have, I am able to run the classifier object on a small data set. So, I divided the data into smaller chunks and running the classifier object in them while storing all these individual object in a pickle file.
Now for testing I need to have the whole object as one to get better result. So my question is how can I combine these objects into one.
objs = []
while True:
try:
f = open(picklename,"rb")
objs.extend(pickle.load(f))
f.close()
except EOFError:
break
Doing this does not work. And it gives the error TypeError: 'NaiveBayesClassifier' object is not iterable.
NaiveBayesClassifier code :
classifier = nltk.NaiveBayesClassifier.train(training_set)
I am not sure about the exact format of your data, but you can not simply merge different classifiers. The Naive Bayes classifier stores a probability distribution based on the data it was trained on, and you can not merge probability distributions without access to the original data.
If you look at the source code here: http://www.nltk.org/_modules/nltk/classify/naivebayes.html
an instance of the classifier stores:
self._label_probdist = label_probdist
self._feature_probdist = feature_probdist
these are calculated in the train method using relative frequency counts. (e.g P(L_1) = (# of L1 in training set) / (# labels in training set). To combine the two, you would want to get (# of L1 in Train 1 + Train 2)/(# of labels in T1 + T2).
However, the naive bayes procedure isn't too hard to implement from scratch, especially if you follow the 'train' source code in the link above. Here is an outline, using the NaiveBayes source code
Store 'FreqDist' objects for each subset of the data for the labels and features.
label_freqdist = FreqDist()
feature_freqdist = defaultdict(FreqDist)
feature_values = defaultdict(set)
fnames = set()
# Count up how many times each feature value occurred, given
# the label and featurename.
for featureset, label in labeled_featuresets:
label_freqdist[label] += 1
for fname, fval in featureset.items():
# Increment freq(fval|label, fname)
feature_freqdist[label, fname][fval] += 1
# Record that fname can take the value fval.
feature_values[fname].add(fval)
# Keep a list of all feature names.
fnames.add(fname)
# If a feature didn't have a value given for an instance, then
# we assume that it gets the implicit value 'None.' This loop
# counts up the number of 'missing' feature values for each
# (label,fname) pair, and increments the count of the fval
# 'None' by that amount.
for label in label_freqdist:
num_samples = label_freqdist[label]
for fname in fnames:
count = feature_freqdist[label, fname].N()
# Only add a None key when necessary, i.e. if there are
# any samples with feature 'fname' missing.
if num_samples - count > 0:
feature_freqdist[label, fname][None] += num_samples - count
feature_values[fname].add(None)
# Use pickle to store label_freqdist, feature_freqdist,feature_values
Combine those using their built-in 'add' method. This will allow you to get the relative frequency across all the data.
all_label_freqdist = FreqDist()
all_feature_freqdist = defaultdict(FreqDist)
all_feature_values = defaultdict(set)
for file in train_labels:
f = open(file,"rb")
all_label_freqdist += pickle.load(f)
f.close()
# Combine the default dicts for features similarly
Use the 'estimator' to create a probability distribution.
estimator = ELEProbDist()
label_probdist = estimator(all_label_freqdist)
# Create the P(fval|label, fname) distribution
feature_probdist = {}
for ((label, fname), freqdist) in all_feature_freqdist.items():
probdist = estimator(freqdist, bins=len(all_feature_values[fname]))
feature_probdist[label, fname] = probdist
classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
The classifier will not combine the counts across all the data and produce what you need.