randomly shuffle 3d numpy array on specific axis - python-3.x

I have loaded 66 csv files, each containing a time series of 5000 time steps and three features with the following code:
rms = glob(sims_dir+"/rms*.csv")
df = [pd.read_csv(f).values for f in rms]
data = np.asarray(df)
data.shape
(66, 5004, 3)
Here my first axis of size 66 are my 66 unique time series. I would like to shuffle my array so that my first dimension (66) is of random order, but aren't exactly sure of the best way to do this... an alternative approach could be to load each csv randomly from its directory but I was wondering how this could be achieved in numpy.

Use np.random.shuffle(data). This function is dedicated for this task. See np.random.shuffle

Related

Why is ColumnTransformer producing a different output using the same code but different .csv files?

I am trying to finish this course tooth and nail with the hopes of being able to do this kind of stuff entry level by Spring time. This is my first post here on this incredible resource, and will do my best to conform to posting format. As a potential way to enforce my learning and commit to long term memory, I'm trying the same things on my own dataset of > 500 entries containing data more relevant to me as opposed to dummy data.
I'm learning about the data preprocessing phase where you fill in missing values and separate the columns into their respective X and Y to be fed into the models later on, if I understand correctly.
So in the course example, it's the top left dataset of countries. Then the bottom left is my own database of data I've been keeping for about a year on a multiplayer game I play. It has 100 or so characters you can choose from who are played between 5 different categorical roles.
Course data set (top left) personal dataset (bottom left
personal dataset column transformed results
What's up with the different outputs that are produced, with the only difference being the dataset (.csv file)? The course's dataset looks right; that first column of countries (textual categories) gets turned into binary vectors in the output no? Why is the output on my data set omitting columns, and producing these bizarre looking tuples followed by what looks like a random number? I've tried removing the np.array function, I've tried printing each output at each level, unable to see what's causing the difference. I expected on my dataset it would transform the characters' names into binary vectors (combinations of 1s/0s?) so the computer can understand the difference and map them to the appropriate results. Instead I'm getting that weird looking output I've never seen before.
EDIT: It turns out these bizarre number combinations are what's called a "sparse matrix." Had to do some research starting with the type() which yielded csr_array. If I understood what I Read correctly all the stuff inside takes up one column, so I just tried all rows/columns using [:] and I didn't get an error.
Really appreciate your time and assistance.
EDIT: Thanks to this thread I was able to make my way to the end of this data preprocessing/import/cleaning/ phase exercise, to feature scaling using my own dataset of ~ 550 rows.
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# IMPORT RAW DATA // ASSIGN X AND Y RAW
df = pd.read_csv('datasets/winpredictor.csv')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
# TRANSFORM CATEGORICAL DATA
ct = ColumnTransformer(transformers=\
[('encoder', OneHotEncoder(), [0, 1])], remainder='passthrough')
le = LabelEncoder()
X = ct.fit_transform(X)
y = le.fit_transform(y)
# SPLIT THE DATA INTO TRAINING AND TEST SETS
X_train, X_test, y_train, y_test = train_test_split(\
X, y, train_size=.8, test_size=.2, random_state=1)
# FEATURE SCALING
sc = StandardScaler(with_mean=False)
X_train[:, :] = sc.fit_transform(X_train[:, :])
X_test[:, :] = sc.transform(X_test[:, :])
First of all I encourage you to keep working with this course and for sure you will be a perfect Data Science in a few weeks.
Let's talk about your problem. It' seems that you only have a problem of visualization due to the big size of different types of "Hero" (I think you have 37 unique values).
I will explain you the results you have plotted. They programm only indicate you the values of the samples that are different of 0:
(0,10)=1 --> 0 refers to the first sample, and 10 refers to the 10th
value of the sample that is equal to 1.
(0,37)=5 --> 0 refers to the first sample, and 37 refers to the 37th, which is equal to 5.
etc..
So your first sample will be something like:
[0,0,0,0,0,0,0,0,0,0,1,.........., 5, 980,-30, 1000, 6023]
Which is the way to express the first sample of "Jakiro".
["Jakiro",5, 980,-30, 1000, 6023]
To sump up, the first 37 values refers to your OneHotEncoder, and last 5 refers to your initial numerical values.
So it seems to be correct, just a different way to plot the result due to the big size of classes of the categorical variable.
You can try to reduce the number of X rows (to 4 for example), and try the same process. Then you will have a similar output as the course.

Appending numpy arrays of differing sizes when you don't know what maximum size you need?

I'm crawling across a folder of WAV files, with each file having the same sample-rate but different lengths. I'm loading these using Librosa and computing a range of spectral features on them. This results in arrays of different sizes due to the differing durations. Trying to then concatenate all of these arrays fails - obviously because of their different shapes, for example:
shape(1,2046)
shape(1,304)
shape(1,154)
So what I've done is before loading the files I use librosa to get the duration of each file and pack it into a list.
class GetDurations:
def __init__(self, files, samplerate):
list = []
self.files = files
self.sampleRate = samplerate
for file in self.files:
list.append(librosa.get_duration(filename=file, sr=44100))
self.maxFileDuration = np.max(list)
Then I get the maximum value of the list, to get the maximum possibly length of my array, and convert it to frames (which is what the spectral extraction features of Librosa work with)
self.maxDurationInFrames = librosa.time_to_frames(self.getDur.maxFileDuration,
sr=44100,hop_length=512) + 1
So now I've got a value that I know will account for the longest duration of my input files. I just need to initialise my array with this length.
allSpectralCentroid = np.zeros((1, self.maxDurationInFrames))[1:]
This gives me an empty container for all of my extracted spectral centroid data for all WAV files in the directory. In order to add data to this array I later on do the following:
padValue = allSpectralCentroid.shape[1] - workingSpectralCentroid.shape[1]
workingSpectralCentroid = np.pad(workingSpectralCentroid[0], ((0, padValue)), mode='constant')[np.newaxis]
allSpectralCentroid = np.append(allSpectralCentroid, workingSpectralCentroid, axis=0)
This subtracts the length of the 'working' array from the 'all' array to get me a pad value. It then pads the working array with zeros to make it the same length as the all array. Finally it then appends the two (joining them together) and assigns this to the 'all' variable.
So....
My question is - is there a more efficient way to do this?
Bonus question - How do I do this when I 100% can never know the length required??

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')
This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Resized copy of Pytorch Tensor/Dataset

I have a homemade dataset with a few million rows. I am trying to make truncated copies. So I clip the tensors that I'm using to make the original dataset and create a new dataset. However, when I save the new dataset, which is only 20K rows, it's the same size on disk as the original dataset. Otherwise everything seems kosher, including, when I check, the size of the new tensors. What am I doing wrong?
#original dataset - 2+million rows
dataset = D.TensorDataset(training_data, labels)
torch.save(dataset, filename)
#20k dataset for experiments
d = torch.Tensor(training_data[0:20000])
l = torch.Tensor(labels[0:20000])
ds_small = D.TensorDataset(d,l)
#this is the same size as the one above on disk... approx 1.45GB
torch.save(ds_small, filename_small)
Thanks
In your code d and training_data share the same memory, even if you use slicing during the creation of d. I don't know why this is the case, but answer anyway to give you a solution:
d = x[0:10000].clone()
l = y[0:10000].clone()
clonewill give you Tensors with a memory independent from the old Tensor's and the file size will be much smaller.
Note that using torch.Tensor() is not necessary when creating d and l since training_data and labels are already tensors.

VTK - How to use vtkNetCDFCFReader to read an array or variable array at specific time frame

Im trying to load an array at a specific time frame (for example if it has 50 frames or time units then get an array corresponding to the 2nd time frame) from netCDF files (.nc). Im currently using vtkNetCDFCFReader and getting the data array "vwnd" from the 1st time frame like this:
vtkSmartPointer<vtkNetCDFCFReader> reader = vtkSmartPointer<vtkNetCDFCFReader>::New();
reader->SetFileName(path.c_str());
reader->UpdateMetaData();
vtkSmartPointer<vtkStructuredGridGeometryFilter> geometryFilter = vtkSmartPointer<vtkStructuredGridGeometryFilter>::New();
geometryFilter->SetInputConnection(reader->GetOutputPort());
geometryFilter->Update();
vtkSmartPointer<vtkPolyData> ncPolydata = vtkSmartPointer<vtkPolyData>::New();
ncPolydata = geometryFilter->GetOutput();
vtkSmartPointer<vtkDataArray> dataArray = ncPolydata->GetCellData()->GetArray("vwnd");
Variable Arrays are : lat, lon, time, vwnd (vwnd has dimensions (lat,lon)). Im also interested in getting arrays for lat and lon. Any help would be appreciated.
Thanks in advance
As the dimension of lat/lon is different from vwnd, you will need 2 vtknetCDFreaders to read in data with different dimensions. Just remember to set the dimension after creating the reader.
For example in C++:
vtknetCDFReader* reader = vtknetCDFReader::New();
reader->SetFileName(fileName.c_str());
reader->UpdateMetaData();
//here you specify the dimension of the reader
reader->SetDimension(dim);
reader->SetVariableArrayStatus("lat",1)
reader->SetVariableArrayStatus("lon",1)
reader->Update();
If you are doing it correctly, you could read in any arrays and store it into vtkDataArray.
If you want to read in the vwnd data in the second time step, just skip the first lat*lon values.

Resources