Memory leak using flow_from_directory - memory-leaks

I'm trying to implement the technique described on the documentation page
https://keras.io/preprocessing/image/
under the heading "Example of transforming images and masks together".
After the following,
image_generator = image_datagen.flow_from_directory(
'data/images',
class_mode=None,
seed=seed)
mask_generator = mask_datagen.flow_from_directory(
'data/masks',
class_mode=None,
seed=seed)
the problem arises with the command:
# combine generators into one which yields image and masks
train_generator = zip(image_generator, mask_generator)
This results in memory usage rising to the maximum possible,
and then swapping also rises to the max, at which point my
system freezes and requires rebooting.
Does anyone have a clue as to what's going on here?

SOLUTION: The problem is that I was using Python 2, and in Python 2 the zip command on these iterators will keep iterating forever. Any amount of memory would soon be exhausted. With Python 3 this wouldn't be a problem.
The solution, if you're using Python 2, is to use itertools.izip instead of zip.

Related

Does fitting Weibull distribution to data using scipy.stats perform poor?

I am working on fitting Weibull distribution on some integer data and estimating relevant shape, scale, location parameters. However, I noticed poor performance of scipy.stats library while doing so.
So, I took a different direction and checked the fit performance by using the code below. I first create 100 numbers using Weibull distribution with parameters shape=3, scale=200, location=1. Subsequently, I estimate the best distribution fit using fitter library.
from fitter import Fitter
import numpy as np
from scipy.stats import weibull_min
# generate numbers
x = weibull_min.rvs(3, scale=200, loc=1, size=100)
# make them integers
data = np.asarray(x, dtype=int)
# fit one of the four distributions
f = Fitter(data, distributions=["gamma", "rayleigh", "uniform", "weibull_min"])
f.fit()
f.summary()
I expect the best fit to be Weibull distribution. I have tried re-running this test. Sometimes Weibull fit is a good estimate. However, most of the time Weibull fit is reported as the worst result. In this case, the estimated parameters are = (0.13836651040093312, 66.99999999999999, 1.3200752378443505). I assume these parameters correspond to shape, scale, location in order. Below is the summary of the fit procedure.
$ f.summary()
sumsquare_error aic bic kl_div
gamma 0.001601 1182.739756 -1090.410631 inf
rayleigh 0.001819 1154.204133 -1082.276256 inf
uniform 0.002241 1113.815217 -1061.400668 inf
weibull_min 0.004992 1558.203041 -976.698452 inf
Additionally, the following plot is produced.
Also, Rayleigh distribution is a special case of Weibull with shape parameter = 2. So, I expect the resulting Weibull fit to be at least as good as Rayleigh.
Update
I ran the tests above on Linux/Ubuntu 20.04 machine with numpy version 1.19.2 and scipy version 1.5.2. The code above seems to run as expected and return proper results for Weibull distribution on a Mac machine.
I have also tested fitting a Weibull distribution on data x generated above on the Linux machine by using an R library fitdistrplus as:
fit.weib <- fitdist(x, "weibull")
and observed that the estimated shape and scale values are found to be very close to the initially given values. The best guess so far is that the problem is due to some Python-Ubuntu bug/incompatibility.
I can be considered as a newbie in this area. So, I am wondering, am I doing something wrong here? Or is this result somehow expected? Any help is greatly appreciated.
Thank you.
Library fitter doesn't allow to specify parameters for distributions such as a, loc, etc. And strangely, Mac produces better fit while Linux heavily pains the results for best fit, for the same version of Numpy and Scipy. Underlying reasons may include different BLAS-LAPACK algorithms designed for Linux and Mac, https://stackoverflow.com/a/49274049/6806531, or weibull_min may not initialize parameter a = 1 which is discussed online, or default floating-point accuracy. However, one can solve the error inside fitter library. Knowing the fact that weib_min is expon_weib with parameter a is fixed as 1, changing the run function inside of _timed_run function in fitter.py as
def run(self):
try:
if distribution == "exponweib":
self.result = func(args,floc=0,fa = 1, **kwargs)
else:
self.result = func(args, floc=0, **kwargs)
except Exception as err:
self.exc_info = sys.exc_info()
and using exponweib as weib_min gives nearly same results as R fitdist.
I am not familiar with the Fitter library, but in order to draw some conclusions I would suggest:
Retry your code, but by taking size=10,000. In this case, there are sufficient datapoints for the fitting methods to utilize. Theoretically, you would then expect the Weibull to deliver the best fit.
I noticed that the location parameter can sometimes be a pain. You could try to run your fits by fixing the location parameter with floc=1 (i.e. equal to your sampling parameter for location). What do you get? Aditionally, FYI, with MLE, it suffices to take loc=min(x), where x is your dataset. For the exponential distribution, this in fact the MLE of the location parameter. For other distributions I am not sure, but I wouldn't be surprised if this holds for other distributions as well. This would reduce the fitting procedure with 1 parameter.
Lastly, I noticed that if you take small values for location/scale/shape for some distributions, the functions logpdf and logcdf of scipy.stats distributions result in np.inf values. In this scenario, you could perhaps use the Powell optimization algorithm and set bounds on the values of your parameters.

Unusual order of dimensions of an image matrix in python

I downloaded a dataset which contains a MATLAB file called 'depths.mat' which contains a 3-dimensional matrix with the dimensions 480 x 640 x 1449. These are actually 1449 images, each with the dimension 640 x 480. I successfully loaded it into python using the scipy library but the problem is the unusual order of the dimensions. This makes Python think that there are 480 images with the dimensions 640 x 1449. I tried to reshape the matrix in python, but a simple reshape operation did not solve my problem.
Any suggestions are welcome. Thank you.
You misunderstood. You do not want to reshape, you want to transpose it. In MATLAB, arrays are A(x,y,z) while in python they are P[z,y,x]. Make sure that once you load the entire matrix, you change the first and last dimensions.
You can do this with the swapaxes function, but beware! it does not make a copy nor change the the data, just changes how the higher level indices of nparray access the internal memory. Your best chances if you have enough RAM is to make a copy and dump the original.

Can kernels other than periodic be used in SGPR in gpflow

I am pretty new to GPR. I will appreciate it if you provide me some suggestion regarding the following questions:
Can we use the Matern52 kernel in a sparse Gaussian process?
What is the best way to select pseudo inputs (Z) ? Is random sampling reasonable?
I would like to mention that when I am using the Matern52 kernel, the following error stops optimization process. My code:
k1 = gpflow.kernels.Matern52(input_dim=X_train.shape[1], ARD=True)
m = gpflow.models.SGPR(X_train, Y_train, kern=k1, Z=X_train[:50, :].copy())
InvalidArgumentError (see above for traceback): Input matrix is not invertible.
[[Node: gradients_25/SGPR-31ceaea6-412/Cholesky_grad/MatrixTriangularSolve = MatrixTriangularSolve[T=DT_DOUBLE, adjoint=false, lower=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](SGPR-31ceaea6-412/Cholesky, SGPR-31ceaea6-412/eye_1/MatrixDiag)]
Any help will be appreciated, thank you.
Have you tried it out on a small test set of data, that you could perhaps post here? There is no reason Matern52 shouldn't work. Randomly sampling inducing points should be a reasonable initialisation, especially in higher dimensions. However, you may run into issues if you end up with some inducing points very close to each other (this can make the K_{zz} = cov(f(Z), f(Z)) matrix badly conditioned, which would explain why the Cholesky fails). If your X_train isn't already shuffled, you may want to use Z=X_train[np.random.permutation(len(X_train))[:50] to get shuffled indices. It may also help to add a white noise kernel, kern=k1+gpflow.kernels.White() ...

Is there a standard way to load/process (audio) data dynamically in tensorflow?

I'm building a network using the Nsynth dataset. It has some 22 Gb of data. Right now I'm loading everything into RAM but this presents some (obvious) problems.
This is an audio dataset and I want to window the signals and produce more examples changing the hop size for example, but because I don't have infinite amounts of RAM there are very little things I can do before I ran out of it (I'm actually only working with a very small subset of the dataset, don't tell google how I live).
Here's some code I'm using right now:
Code:
def generate_audio_input(audio_signal, window_size):
audio_without_silence_at_beginning_and_end = trim_silence(audio_signal, frame_length=window_size)
splited_audio = windower(audio_without_silence_at_beginning_and_end, window_size, hop_size=2048)
return splited_audio
start = time.time()
audios = StrechableNumpyArray()
window_size = 5120
pathToDatasetFolder = 'audio/test'
time_per_loaded = []
time_to_HD = []
for file_name in os.listdir(pathToDatasetFolder):
if file_name.endswith('.wav'):
now = time.time()
audio, sr = librosa.load(pathToDatasetFolder + '/' + file_name, sr=None)
time_to_HD.append(time.time()-now)
output = generate_audio_input(audio, window_size)
audios.append(np.reshape(output, (-1)))
time_per_loaded.append(time.time()-now)
audios = audios.finalize()
audios = np.reshape(audios, (-1, window_size))
np.random.shuffle(audios)
end = time.time()-start
print("wow, that took", end, "seconds... might want to change that to mins :)")
print("On average, it took", np.average(time_per_loaded), "per loaded file")
print("With an standard deviation of", np.std(time_per_loaded))
I'm thinking I could load only the filenames, shuffle those and then yield X loaded results for a more dynamical approach, but in that case I will still have all the different windows for a sound inside those X loaded results, giving me not a very good randomization.
I've also looked into TFRecords but I don't think that would improve anything from what I propose in the last paragraph.
So, to the clear question: Is there a standard way to load/process (audio) data dynamically in tensorflow?
I would appreciate it if the response is tailored to the particular problem I'm addressing of pre-processing my dataset before starting training.
I would also accept it if the answer is pre-process the data and save it into a TFRecord and then load the TFRecord, but I think that's sort of an overkill.
After discussing with some colleges during the last few months, I now think that the standard is indeed to use TFRecords. After making a few and understanding how to work with them I found several advantages and some drawbacks when using them with audio.
Advantages:
They completely all enqueuing issues with minimal strain on RAM.
There are solutions to load examples randomly. How many examples you load on RAM will depend on how frequently you want to go to the HD and how much information you want to load each time you access it.
They are easy to share and the pre-processing is (usually) already incorporated. You can have several processes using them or several people across different continents with a certainty that you are all using the same data. This is not true when working with raw audio and processing it on the fly as different software may apply computations differently (i.e. stft implementations may change soon).
Drawbacks:
They are too static. If you want to change your dataset in any way you need to create a new one. There is no way to modify every or any example. E.g., after a few iterations I decided to discard tensors with low amplitude. I could handle that in the code after loading a batch, but the only sensible way would be to discard the whole batch every time I found an outlier.
Creating them is a cumbersome and slow process. There is no way to start working with a TFRecord until it's complete. Additionally, if you decide to change the size of the tensors or the data type, you're going to have to make extra changes to your code and test them as some errors (e.g. data types) just pass silently.
Large on HD. Because TFRecords have examples that are feed directly into your network, they are not equivalent to raw audio files and you can not erase them. And because some of the examples in the TFRecord are product of data-augmentation techniques, they tend to be larger than the original files. (This last one is probably just a normal consequence of working with big datasets).
All in all, I think even though they are not tailored for audio and they are not very easy to implement at first, they are quite convenient and useful. Which is probably the reason why most people that work with big datasets and whom I've asked this question said they use them.

Reducing / Enhancing known features in an image

I am microbiology student new to computer vision, so any help will be extremely appreciated.
This question involves microscope images that I am trying to analyze. The goal I am trying to accomplish is to count bacteria in an image but I need to pre-process the image first to enhance any bacteria that are not fluorescing very brightly. I have thought about using several different techniques like enhancing the contrast or sharpening the image but it isn't exactly what I need.
I want to reduce the noise(black spaces) to 0's on the RBG scale and enhance the green spaces. I originally was writing a for loop in OpenCV with threshold limits to change each pixel but I know that there is a better way.
Here is an example that I did in photo shop of the original image vs what I want.
Original Image and enhanced Image.
I need to learn to do this in a python environment so that I can automate this process. As I said I am new but I am familiar with python's OpenCV, mahotas, numpy etc. so I am not exactly attached to a particular package. I am also very new to these techniques so I am open to even if you just point me in the right direction.
Thanks!
You can have a look at histogram equalization. This would emphasize the green and reduce the black range. There is an OpenCV tutorial here. Afterwards you can experiment with different thresholding mechanisms that best yields the bacteria.
Use TensorFlow:
create your own dataset with images of bacteria and their positions stored in accompanying text files (the bigger the dataset the better).
Create a positive and negative set of images
update default TensorFlow example with your images
make sure you have a bunch of convolution layers.
train and test.
TensorFlow is perfect for such tasks and you don't need to worry about different intensity levels.
I initially tried histogram equalization but did not get the desired results. So I used adaptive threshold using the mean filter:
th = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV, 3, 2)
Then I applied the median filter:
median = cv2.medianBlur(th, 5)
Finally I applied morphological closing with the ellipse kernel:
k1 = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,(5,5))
dilate = cv2.morphologyEx(median, cv2.MORPH_CLOSE, k1, 3)
THIS PAGE will help you modify this result however you want.

Resources