Resized copy of Pytorch Tensor/Dataset - pytorch

I have a homemade dataset with a few million rows. I am trying to make truncated copies. So I clip the tensors that I'm using to make the original dataset and create a new dataset. However, when I save the new dataset, which is only 20K rows, it's the same size on disk as the original dataset. Otherwise everything seems kosher, including, when I check, the size of the new tensors. What am I doing wrong?
#original dataset - 2+million rows
dataset = D.TensorDataset(training_data, labels)
torch.save(dataset, filename)
#20k dataset for experiments
d = torch.Tensor(training_data[0:20000])
l = torch.Tensor(labels[0:20000])
ds_small = D.TensorDataset(d,l)
#this is the same size as the one above on disk... approx 1.45GB
torch.save(ds_small, filename_small)
Thanks

In your code d and training_data share the same memory, even if you use slicing during the creation of d. I don't know why this is the case, but answer anyway to give you a solution:
d = x[0:10000].clone()
l = y[0:10000].clone()
clonewill give you Tensors with a memory independent from the old Tensor's and the file size will be much smaller.
Note that using torch.Tensor() is not necessary when creating d and l since training_data and labels are already tensors.

Related

Flat-field correction on hyperspectral data

I am working on hyperspectral data set using the spectral python library. I started using python for the first time on Monday, so everything is taking me a long time.
My data is in envi format, and i believe I have successfully read it in and connverted to numpy arrays.
I am attempting a flat field correction using this code
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr), np.subtract(white_nparr, dark_nparr))
ValueError: operands could not be broadcast together with shapes (1367,384,288) (100,384,288)
This doesnt work because my white reference and dark reference are a different size to the data capture.
print(white_nparr.shape)
(297, 384, 288)
print(dark_nparr.shape)
(100, 384, 288)
print(data_nparr.shape)
(1367, 384, 288)
So, I understand why I am getting the error. The original white and dark ref were captured using different image sizes to the dataset. So, my problem is creating a correction for the dataset whilst only having access to references of different sizes
Has anyone handled this before? What approach did you use?
btw the data I am using is mineral hyperspectral data captured from drill core, there is a huge dataset held by Geological Survey Ireland and is free upon request
So, I recieved and extremely helpful answer, which actually sparked a further question
# created these files to broadcast as they are a horizontal line of spectra,
#a 2D array which captures the variation
white_nparr_horiz = white_nparr[-2]
dark_nparr_horiz = dark_nparr[-2]
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr_horiz), np.subtract(white_nparr_horiz, dark_nparr_horiz))
white_nparr_horiz.shape
Out[28]: (384, 288)
dark_nparr_horiz.shape Out[29]: (384, 288)
So the shape of these arrays are broadcastable accross the data_ref, and I have tested that it works as I expect with this, on a few different indices, and it does.
a = white_nparr_horiz[150, 144]
b = dark_nparr_horiz[150, 144]
c = data_nparr[500, 150, 144]
d = (c - b)/(a-b)
test = d == corrected_nparr[500, 150, 144]
print(test)
The output from this looks much more as I would expect reflectance data for this material to look, so I believe I am on the right path.
What I would like to do now is have white_nparr_horiz be the mean of each band along the original first axis in the white_ref (297, 384, 288), returned in an array of (384, 288), as opposed to a single value as I believe it is now. I am sure that this is possible, but I cannot figure out how.
As I said above, very new to python, numpy and image analysis, so apologies if this is obvious or I am going in the wrong direction
The problem is that your white and dark references should each be a single spectrum (1D array with 288 values), whereas yours are both 3-dimensional arrays (likely corresponding to image regions). To convert them to 1D, you can compute the mean, max, or min of each array, as appropriate. For example, to take the min of the dark reference and max of the white reference, you could convert them as follows:
dark_nparr = np.min(dark_nparr.reshape(-1, dark_nparr.shape[-1]), axis=0)
white_nparr = np.max(white_nparr.reshape(-1, white_nparr.shape[-1]), axis=0)
The lines above reshape the arrays to 2 dimensions and compute the max (or min) of the reshaped arrays.
If you prefer to use the spectral mean of each array instead, just replace np.max and np.min above with np.mean.
If you want each array to just be averaged over its first dimension, then (i.e., have shape (384, 288)), then just don't reshape the arrays when doing the reduction.
dark_nparr = np.min(dark_nparr, axis=0)
white_nparr = np.max(white_nparr, axis=0)

memory issues for sparse one hot encoded features

I want to create sparse matrix for one hot encoded features from data frame df. But I am getting memory issue for code given below. Shape of sparse_onehot is (450138, 1508)
sp_features = ['id', 'video_id', 'genre']
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features)
import scipy
X = scipy.sparse.csr_matrix(sparse_onehot.values)
I get memory error as shown below.
MemoryError: Unable to allocate 647. MiB for an array with shape (1508, 450138) and data type uint8
I have tried scipy.sparse.lil_matrix and get same error as above.
Is there any efficient way of handling this?
Thanks in advance
Try setting to True the sparse parameter:
sparsebool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).
sparse_onehot = pd.get_dummies(df[sp_features], columns = sp_features, sparse = True)
This will use a much more memory efficient (but somewhat slower) representation than the default one.

randomly shuffle 3d numpy array on specific axis

I have loaded 66 csv files, each containing a time series of 5000 time steps and three features with the following code:
rms = glob(sims_dir+"/rms*.csv")
df = [pd.read_csv(f).values for f in rms]
data = np.asarray(df)
data.shape
(66, 5004, 3)
Here my first axis of size 66 are my 66 unique time series. I would like to shuffle my array so that my first dimension (66) is of random order, but aren't exactly sure of the best way to do this... an alternative approach could be to load each csv randomly from its directory but I was wondering how this could be achieved in numpy.
Use np.random.shuffle(data). This function is dedicated for this task. See np.random.shuffle

Custom Lambert Conformal projection of 2D data gives very large file size when saved as pdf (16MB)

I'm trying to save a figure to use in my thesis, but the figure becomes 16MB (The figures made from PlateCarree projections are order of 100-200kB)
I have provided the code, any tips, tricks and otherwise would be greatly appreciated.
SUM is a netcdf-file with coords len(y) = 949 and len(x) = 889
projection = ccrs.LambertConformal(central_latitude=63,central_longitude=15,standard_parallels=[63,63],cutoff=-30)
fig = plt.figure(figsize = (12,10))
ax = plt.subplot(projection=projection)
SUM.plot.pcolormesh(ax=ax,cbar_kwargs = {"label":"Temperature - [C]"})
ax.set_title("Mean Temperature - 750m height")
ax.coastlines("50m")
fig.savefig(outsource+"MeanT750.pdf")
I have tried giving the DPI-argument to savefig, but it does not change the filesize.
A PDF file in your case, contains vector data. You can reduce the level of details to decrease the file size. For example:
ax.coastlines("110m")
will save thousands of bytes over 50m resolution, but still represents good visualization on the map.
For the pcolormesh(), the memory varies with the value (rows*columns) of your meshed data. If you reduce number of rows/columns, you reduce the memory.
However, if contourf() can be used instead of pcolormesh(), you may save much more.

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Resources