GMM/EM on time series cluster

GMM/EM on time series cluster - scikit-learn

According to a paper, it is supposed to work.
But as a learner of scikit-learn package.. I do not see how.
All the sample codes cluster by ellipses or circles as here.
I would really like to know how to cluster the following plot by different patterns... 0 -3 are the mean of power over certain time periods (divided into 4) while 4, 5, 6 each correspond to standard deviation of the year, variance in weekday/weekend, variance in winter/summer. So the ylabel does not necessarily meet with 4,5,6.
Following the sample..BIC did generate that the optimal number of clusters is 5.
n_components = np.arange(1, 21)
models = [GMM(n, covariance_type='full', random_state=0).fit(input)
for n in n_components]
plt.plot(n_comp, [m.bic(read) for m in models], label = 'BIC')
plt.legend(loc='best')
plt.xlabel('n_components')
If I plot with the sample code available however.. it returns something completely weird, not worth sharing. I though negative BIC was ok. But I don't even know if it clustered correctly to deduce that 5 is the optimal number.

Basically in an effort to close this question..my following post answers how to cluster using GMM.
Create a model using the parameters accordingly
gmm = GaussianMixture(n_components=10, covariance_type ='full', \
init_params = 'random', max_iter = 100, random_state=0)
Fit your data (number of samples x number of attributes) whose name is input in my case
gmm.fit(input)
print(gmm.means_.round(2))
cluster = gmm.predict(input)
Cluster contains the labels to each of these samples of my input
Feel free to add, if I've gotten anything wrong

Related

Flat-field correction on hyperspectral data

I am working on hyperspectral data set using the spectral python library. I started using python for the first time on Monday, so everything is taking me a long time.
My data is in envi format, and i believe I have successfully read it in and connverted to numpy arrays.
I am attempting a flat field correction using this code
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr), np.subtract(white_nparr, dark_nparr))
ValueError: operands could not be broadcast together with shapes (1367,384,288) (100,384,288)
This doesnt work because my white reference and dark reference are a different size to the data capture.
print(white_nparr.shape)
(297, 384, 288)
print(dark_nparr.shape)
(100, 384, 288)
print(data_nparr.shape)
(1367, 384, 288)
So, I understand why I am getting the error. The original white and dark ref were captured using different image sizes to the dataset. So, my problem is creating a correction for the dataset whilst only having access to references of different sizes
Has anyone handled this before? What approach did you use?
btw the data I am using is mineral hyperspectral data captured from drill core, there is a huge dataset held by Geological Survey Ireland and is free upon request
So, I recieved and extremely helpful answer, which actually sparked a further question
# created these files to broadcast as they are a horizontal line of spectra,
#a 2D array which captures the variation
white_nparr_horiz = white_nparr[-2]
dark_nparr_horiz = dark_nparr[-2]
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr_horiz), np.subtract(white_nparr_horiz, dark_nparr_horiz))
white_nparr_horiz.shape
Out[28]: (384, 288)
dark_nparr_horiz.shape Out[29]: (384, 288)
So the shape of these arrays are broadcastable accross the data_ref, and I have tested that it works as I expect with this, on a few different indices, and it does.
a = white_nparr_horiz[150, 144]
b = dark_nparr_horiz[150, 144]
c = data_nparr[500, 150, 144]
d = (c - b)/(a-b)
test = d == corrected_nparr[500, 150, 144]
print(test)
The output from this looks much more as I would expect reflectance data for this material to look, so I believe I am on the right path.
What I would like to do now is have white_nparr_horiz be the mean of each band along the original first axis in the white_ref (297, 384, 288), returned in an array of (384, 288), as opposed to a single value as I believe it is now. I am sure that this is possible, but I cannot figure out how.
As I said above, very new to python, numpy and image analysis, so apologies if this is obvious or I am going in the wrong direction

The problem is that your white and dark references should each be a single spectrum (1D array with 288 values), whereas yours are both 3-dimensional arrays (likely corresponding to image regions). To convert them to 1D, you can compute the mean, max, or min of each array, as appropriate. For example, to take the min of the dark reference and max of the white reference, you could convert them as follows:
dark_nparr = np.min(dark_nparr.reshape(-1, dark_nparr.shape[-1]), axis=0)
white_nparr = np.max(white_nparr.reshape(-1, white_nparr.shape[-1]), axis=0)
The lines above reshape the arrays to 2 dimensions and compute the max (or min) of the reshaped arrays.
If you prefer to use the spectral mean of each array instead, just replace np.max and np.min above with np.mean.
If you want each array to just be averaged over its first dimension, then (i.e., have shape (384, 288)), then just don't reshape the arrays when doing the reduction.
dark_nparr = np.min(dark_nparr, axis=0)
white_nparr = np.max(white_nparr, axis=0)

Why is my notebook crashing when I run this for loop and what is the fix?

I have taken code in relation to the Kalman Filter and am attempting to iterate through each column of data. What I would like to have happen is:
The column data is fed into the filter
The filtered column data (xhat) is placed into another DataFrame (filtered)
The filtered column data (xhat) is used to produce a visual.
I have created a for loop to iterate through the column data, but when I run the cell, I crash the notebook. When it doesn't crash, I get this warning:
C:\Users\perso\Anaconda3\envs\learn-env\lib\site-packages\ipykernel_launcher.py:45: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
Thanks in advance for any help. I hope this question is detailed enough. I bombed on the last one.
'''A Python implementation of the example given in pages 11-15 of "An
Introduction to the Kalman Filter" by Greg Welch and Gary Bishop,
University of North Carolina at Chapel Hill, Department of Computer
Science, TR 95-041,
https://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf'''
# by Andrew D. Straw
import numpy as np
import matplotlib.pyplot as plt
# dataframe created to hold filtered data
filtered = pd.DataFrame()
# intial parameters
for column in data:
n_iter = len(data.index) #number of iterations equal to sample numbers
sz = (n_iter,) # size of array
z = data[column] # observations
Q = 1e-5 # process variance
# allocate space for arrays
xhat=np.zeros(sz) # a posteri estimate of x
P=np.zeros(sz) # a posteri error estimate
xhatminus=np.zeros(sz) # a priori estimate of x
Pminus=np.zeros(sz) # a priori error estimate
K=np.zeros(sz) # gain or blending factor
R = 1.0**2 # estimate of measurement variance, change to see effect
# intial guesses
xhat[0] = z[0]
P[0] = 1.0
for k in range(1,n_iter):
# time update
xhatminus[k] = xhat[k-1]
Pminus[k] = P[k-1]+Q
# measurement update
K[k] = Pminus[k]/( Pminus[k]+R )
xhat[k] = xhatminus[k]+K[k]*(z[k]-xhatminus[k])
P[k] = (1-K[k])*Pminus[k]
# add new data to created dataframe
filtered.assign(a = [xhat])
#create visualization of noise reduction
plt.rcParams['figure.figsize'] = (10, 8)
plt.figure()
plt.plot(z,'k+',label='noisy measurements')
plt.plot(xhat,'b-',label='a posteri estimate')
plt.legend()
plt.title('Estimate vs. iteration step', fontweight='bold')
plt.xlabel('column data')
plt.ylabel('Measurement')

This seems like a pretty straightforward error. The warning indicates that you have attempted to plot more figures than the current limit before a warning is created (a parameter you can change but which by default is set to 20). This is because in each iteration of your for loop, you create a new figure. Depending on the size of n_iter, you are opening potentially hundreds or thousands of figures. Each of these figures takes resources to generate and show, so you are creating a very large resource load on your system. Either it is processing very slowly due or is crashing altogether. In any case, the solution is to plot fewer figures.
I don't know exactly what you're plotting in your loop but it seems like each iteration of your loop corresponds to one time step and at each time step you'd like to plot the estimated and actual values. In this case, you need to define a figure and figure options once, outside of the loop, rather than at each iteration. But a better way to do this is probably to generate all of the data you want to plot ahead of time and store it in an easy-to-plot datatype like lists, then plot it once at the end.

Geospatial fixed radius cluster hunting in python

I want to take an input of millions of lat long points (with a numerical attribute) and then find all fixed radius geospatial clusters where the sum of the attribute within the circle is above a defined threshold.
I started by using sklearn BallTree to sum the attribute within any defined circle, with the intention of then expanding this out to run across a grid or lattice of circles. The run time for one circle is around 0.01s, so this is fine for small lattices, but won't scale if I want to run 200m radius circles across the whole of the UK.
#example data (use 2m rows from postcode centroid file)
df = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000000)
#this will be our grid of points (or lattice) use points from same file for example
df2 = pandas.read_csv('National_Statistics_Postcode_Lookup_Latest_Centroids.csv', usecols=[0,1], nrows=2000)
#reorder lat long columns for balltree input
columnTitles=["Y","X"]
df = df.reindex(columns=columnTitles)
df2 = df2.reindex(columns=columnTitles)
# assign new columns to existing dataframe. attribute will hold the data we want to sum over (set to 1 for now)
df['attribute'] = 1
df2['aggregation'] = 0
RADIANT_TO_KM_CONSTANT = 6367
class BallTreeIndex:
def __init__(self, lat_longs):
self.lat_longs = np.radians(lat_longs)
self.ball_tree_index =BallTree(self.lat_longs, metric='haversine')
def query_radius(self,query,radius):
radius_km = radius/1000
radius_radiant = radius_km / RADIANT_TO_KM_CONSTANT
query = np.radians(np.array([query]))
indices = self.ball_tree_index.query_radius(query,r=radius_radiant)
return indices[0]
#index the base data
a=BallTreeIndex(df.iloc[:,0:2])
#begin to loop over the lattice to test performance
for i in range(0,100):
b = df2.iloc[i,0:2]
output = a.query_radius(b, 200)
accumulation = sum(df.iloc[output, 2])
df2.iloc[i,2] = accumulation
It feels as if the above code is really inefficient as I don't need to run the calculation across all circles on my lattice (as most will be well below my threshold - or will have no data points in at all).
Instead of this for loop, is there a better way of scaling this algorithm to give me the most dense circles?
I'm new to python, so any help would be massively appreciated!!

First don't try to do this on a sphere! GB is small and we have a well defined geographic projection that will work. So use the oseast1m and osnorth1m columns as X and Y. They are in metres so no need to convert (roughly) to degrees and use Haversine. That should help.
Next add a spatial index to speed up lookups.
If you need more speed there are various tricks like loading a 2R strip across the country into memory and then running your circles across that strip, then moving down a grid step and updating that strip (checking Y values against a fixed value is quick, especially if you store the data sorted on Y then X value). If you need more speed then look at any of the papers the Stan Openshaw (and sometimes I) wrote about parallelising the GAM. There are examples of implementing GAM in python (e.g. this paper, this paper) that may also point to better ways.

How to SVM Train my Edge images using Java code

I have set of images on which I performed edge detection using OpenCV 3.1. The edges are stored in MAT of OpenCV. Can someone help me in processing for Java SVM train and test code on those set of images ?

Following discussion in comments I am providing you with an example project which I built for android studio a while back.
This was used to classify images depending on Lab color spaces.
//1.a Assign the parameters for SVM training here
double nu = 0.999D;
double gamma = 0.4D;
double epsilon = 0.01D;
double coef0 = 0;
//kernel types are Linear(0), Poly(1), RBF(2), Sigmoid(3)
//For Poly(1) set degree and gamma
double degree = 2;
int kernel_type = 4;
//1.b Create an SVM object
SVM B_channel_svm = SVM.create();
B_channel_svm.setType(104);
B_channel_svm.setNu(nu);
B_channel_svm.setCoef0(coef0);
B_channel_svm.setKernel(kernel_type);
B_channel_svm.setDegree(degree);
B_channel_svm.setGamma(gamma);
B_channel_svm.setTermCriteria(new TermCriteria(2, 10, epsilon));
// Repeat Step 1.b for the number of SVMs.
//2. Train the SVM
// Note: training_data - If your image has n rows and m columns, you have to make a matrix of size (n*m, o), where o is the number of labels.
// Note: Label_data is same as above, n rows and m columns, make a matrix of size (n*m, o) where o is the number of labels.
// Note: Very Important - Train the SVM for the entire data as training input and the specific column of the Label_data as the Label. Here, I train the data using B, G and R channels and hence, the name B_channel_SVM. I make 3 different SVM objects separately but you can do this by creating only one object also.
B_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(0));
G_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(1));
R_channel_svm.train(training_data, Ml.ROW_SAMPLE, Label_data.col(2));
// Now after training we "predict" the outcome for a sample from the trained SVM. But first, lets prepare the Test data.
// As above for the training data, make a matrix of (n*m, o) and use the columns to predict. So, since I created 3 different SVMs, I will input three separate matrices for the three SVMs of size (n*m, 1).
//3. Predict the testing data outcome using the trained SVM.
B_channel_svm.predict(scene_ml_input, predicted_final_B, StatModel.RAW_OUTPUT);
G_channel_svm.predict(scene_ml_input, predicted_final_G, StatModel.RAW_OUTPUT);
R_channel_svm.predict(scene_ml_input, predicted_final_R, StatModel.RAW_OUTPUT);
//4. Here, predicted_final_ are matrices which gives you the final value as in Label(0,1,2... etc) for the input data (edge profile in your case)
Now, I hope you have an idea for how SVM works. You basically need to do these steps:
Step 1: Identify labels - In your case Gestures from edge profile.
Step 2: Assign values to the labels - For example, if you are trying to classify haptic gestures - Open Hand = 1, Closed Hand/Fist = 2, Thumbs up = 3 and so on.
Step 3: Prepare the training data (edge profiles) and Labels (1,2,3) etc. according to the process above.
Step 4: Prepare data for prediction using the transformation calculated using SVM.
Very Important for SVM on OpenCV - Normalize your data, make sure you all matrices are of Same Type - CvType
Hope it helps. Feel free to ask questions if you have any doubts and post what you have tried. I can solve the problem for you if you send me some images but then you won't learn anything right? ;)

How to deal with temporal correlation/trend in mppm

Good day,
I have been working through Baddeley et al. 2015 to fit a point process model to several point patterns using mppm {spatstat}.
My point patterns are annual count data of large herbivores (i.e. point localities (x, y) of male/female animals * 5 years) in a protected area (owin). I have a number of spatial covariates e.g. distance to rivers (rivD) and vegetation productivity (NDVI).
Originally I fitted a model where herbivore response was a function of rivD + NDVI and allowed the coefficients to vary by sex (see mppm1 in reproducible example below). However, my annual point patterns are not independent between years in that there is a temporally increasing trend (i.e. there are exponentially more animals in year 1 compared to year 5).
So I added year as a random effect, thinking that if I allowed the intercept to change per year I could account for this (see mppm2).
Now I'm wondering if this is the right way to go about it? If I was fitting a GAMM gamm {mgcv} I would add a temporal correlation structure e.g. correlation = corAR1(form=~year) but don't think this is possible in mppm (see mppm3)?
I would really appreciate any ideas on how to deal with this temporal correlation structure in a replicated point pattern with mppm {spatstat}.
Thank you very much
Sandra
# R version 3.3.1 (64-bit)
library(spatstat) # spatstat version 1.45-2.008
#### Simulate point patterns
# multitype Neyman-Scott process (each cluster is a multitype process)
nclust2 = function(x0, y0, radius, n, types=factor(c("male", "female"))) {
X = runifdisc(n, radius, centre=c(x0, y0))
M = sample(types, n, replace=TRUE)
marks(X) = M
return(X)
}
year1 = rNeymanScott(5,0.1,nclust2, radius=0.1, n=5)
# plot(year1)
#-------------------
year2 = rNeymanScott(10,0.1,nclust2, radius=0.1, n=5)
# plot(year2)
#-------------------
year2 = rNeymanScott(15,0.1,nclust2, radius=0.1, n=10)
# plot(year2)
#-------------------
year3 = rNeymanScott(20,0.1,nclust2, radius=0.1, n=10)
# plot(year3)
#-------------------
year4 = rNeymanScott(25,0.1,nclust2, radius=0.1, n=15)
# plot(year4)
#-------------------
year5 = rNeymanScott(30,0.1,nclust2, radius=0.1, n=15)
# plot(year5)
#### Simulate distance to rivers
line <- psp(runif(10), runif(10), runif(10), runif(10), window=owin())
# plot(line)
# plot(year1, add=TRUE)
#------------------------ UPDATE ------------------------#
#### Create hyperframe
#---> NDVI simulated with distmap to point patterns (not ideal but just to test)
hyp.years = hyperframe(year=factor(2010:2014),
ppp=list(year1,year2,year3,year4,year5),
NDVI=list(distmap(year5),distmap(year1),distmap(year2),distmap(year3),distmap(year4)),
rivD=distmap(line),
stringsAsFactors=TRUE)
hyp.years$numYear = with(hyp.years,as.numeric(year)-1)
hyp.years
#### Run mppm models
# mppm1 = mppm(ppp~(NDVI+rivD)/marks,data=hyp.years); summary(mppm1)
#..........................
# mppm2 = mppm(ppp~(NDVI+rivD)/marks,random = ~1|year,data=hyp.years); summary(mppm2)
#..........................
# correlation = corAR1(form=~year)
# mppm3 = mppm(ppp~(NDVI+rivD)/marks,correlation = corAR1(form=~year),use.gam = TRUE,data=hyp.years); summary(mppm3)
###---> Run mppm model with annual trend and random variation in growth
mppmCorr = mppm(ppp~(NDVI+rivD+numYear)/marks,random = ~1|year,data=hyp.years)
summary(mppm1)

If there's a trend in population size over time, then it might make sense to include this trend in the systematic part of the model. I would suggest you add a new numeric variable NumYear to the data frame (eg giving the number of years since 2010). Then try adding simple trend terms such as +NumYear to the model formula (this would correspond to the exponential growth in population that you observed.) You can keep the 1|year random effect term which will then allow for random variation in population size around the long term growth trend.
There's no need to split the data patterns for each year into separate male and female patterns. The variable marks in the model formula can be used to specify any model that depends on sex.
I'm pretty sure that mppm with use.gam=TRUE does not recognise the argument correlation and this is probably just ignored. (It depends what happens inside gam).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

GMM/EM on time series cluster - scikit-learn

Related

Flat-field correction on hyperspectral data

Why is my notebook crashing when I run this for loop and what is the fix?

Geospatial fixed radius cluster hunting in python

How to SVM Train my Edge images using Java code

How to deal with temporal correlation/trend in mppm

Categories

Resources