Finding the second peak in two normally distributed data - gaussian

I am fitting 2 gaussians to a data. The fitting requires a guessed value of the centers of the peaks as start parameters. I managed to get the first max using the following (the gaussian is defined by 'x' and 'y' in my data):
maxd = max(y) #peak
center1 = x[y.argmax(maxd)] #guess of center 1
I am now trying to get the guess of the second peak (center 2) for input in my fitting. Any idea to do that?
The thing is that I could do it one-by-one by eye but I am doing fitting in batch for several data structures. So, I need an automated way.enter image description here
Thanks!
Rajeev
Cheers,
Rajeev

Related

Identify short signal peaks with rolling time frame in python

I am trying to identify all peaks from my sensor readings data. The smallest peak can be lesser than 10 amplitude and largest can be more than 400 amplitude. The rolling time window is not fixed as one peak can arrive in 6 hours vs second one in another 3 hours. I tried wavelet transform and python peak identification but that is only working for higher peaks. How do I resolve this? Here is signal image link, all peaks in Grey color I am identifying and in blue is my algorithm
Welcome to SO.
It is hard to provide you with a detailed answer without knowing your data's sampling rate and the duration of the peaks. From what I see in your example image they seem all over the place!
I don't think that wavelets will be of any use for your problem.
A recipe that I like to use to despike data is:
Smooth your input data using a median filter (a 11 points median filter generally does the trick for me): smoothed=scipy.signal.medfilt(data, window_len=11)
Compute a noise array by subtracting smoothed from data: noise=data-smoothed
Create a despiked_data array from data:
despiked_data=np.zeros_like(data)
np.copyto(despiked_data, data)
Then every time the noise exceeds a user defined threshold (mythreshold), replace the corresponding value in despiked_data with nan values: despiked_data[np.abs(noise)>mythreshold]=np.nan
You may later interpolate the output despiked_data array but if your intent is simply to identify the spikes, you don't even need to run this extra step.

How to do clustering on a set of paraboloids and on a set of planes?

I am performing cluster analysis in two parts: in the first part (a), it is clustering on a set of paraboloides and in the second part (b), on a set of planes. The parts are separated, but in both I initially had one set of images, on every image of which I have detected the points to which I (a) fit the paraboloid and (b) the plane. I obtained the equations of the surfaces (paraboloids and planes) so now I have 2 sets of data, for (a) it is the array of the arrays of size 6 (6 coefficients of the equation of the paraboloid) and for (b) it is the array of the arrays of size 3 (3 coefficients of the equation of the plane).
I want to cluster both groups based on the similarities of (a) paraboloids and (b) planes. I am not sure which features of the surfaces (paraboloids and planes) are suitable for clustering.
For (b) I have tried using the angle between the fitted plane and the plane z = 0 -- so only 1 feature for every object in the sample.
I have also tried simply considering these 3 (or 6) coefficients to be seperate variables, but I believe that this way I am not using the fact that this coefficients are connected with each other.
I would be really greatful to hear if there is a better approach what features to use except merely a set of coefficients. Also, I am performing hierarchical and agglomerative clustering.

Moving window with Gaussian shape over time series data in Python

I shortly started with Python, Pandas, Numpy, etc. for my master thesis. It concerns about time series (sensor) data to be fed into an LSTM (multi-class classification problem). The current dataset represents a dataframe with 32 features (columns).
My goal:
I want to verify the influence of moving windows and other methods on the LSTM's model performance. For this I already implemented a moving window method (median) with a size of 10 and exponential smoothing on the whole trace:
#Window size: 10
df_Median = df.rolling(10).median()
#application on whole trace
df_EWM_mean = df.ewm(alpha=0.2).mean()
My challenge:
The third method should be a moving window with a gaussian shape (Gauss kernel). I implemented it according to the Pandas documentation:
Pandas.DataFrame.rolling
for header, content in df.iteritems():
stdDev = np.std(content)
if stdDev != 0:
win_len = 10
df_Gauss = df.loc[:,header].rolling(win_len, win_type='gaussian').mean(std=stdDev)
This actually works. But the Standard Deviation argument stdDev must be calculated by the 10 current data points within the window, not from the whole trace... So I'd have to select the 10 current data points, calculate the stdDev, hand it over to the function, shift the window 10 points to the right and start from beginning...
How could this be accomplished? I'm quite new to Python, Pandas etc.
So I'd be very grateful for hints.
Many thanks in advance.

Using scipy.stats.entropy on gmm.predict_proba() values

Background so I don't throw out an XY problem -- I'm trying to check the goodness of fit of a GMM because I want statistical back-up for why I'm choosing the number of clusters I've chosen to group these samples. I'm checking AIC, BIC, entropy, and root mean squared error. This question is about entropy.
I've used kmeans to cluster a bunch of samples, and I want an entropy greater than 0.9 (stats and psychology are not my expertise and this problem is both). I have 59 samples; each sample has 3 features in it. I look for the best covariance type via
for cv_type in cv_types:
for n_components in n_components_range:
# Fit a Gaussian mixture with EM
gmm = mixture.GaussianMixture(n_components=n_components,
covariance_type=cv_type)
gmm.fit(data3)
where the n_components_range is just [2] (later I'll check 2 through 5).
Then I take the GMM with the lowest AIC or BIC, saved as best_eitherAB, (not shown) of the four. I want to see if the label assignments of the predictions are stable across time (I want to run for 1000 iterations), so I know I then need to calculate the entropy, which needs class assignment probabilities. So I predict the probabilities of the class assignment via gmm's method,
probabilities = best_eitherAB.predict_proba(data3)
all_probabilities.append(probabilities)
After all the iterations, I have an array of 1000 arrays, each contains 59 rows (sample size) by 2 columns (for the 2 classes). Each inner row of two sums to 1 to make the probability.
Now, I'm not entirely sure what to do regarding the entropy. I can just feed the whole thing into scipy.stats.entropy,
entr = scipy.stats.entropy(all_probabilities)
and it spits out numbers - as many samples as I have, I get a 2 item numpy matrix for each. I could feed just one of the 1000 tests in and just get 1 small matrix of two items; or I could feed in just a single column and get a single values back. But I don't know what this is, and the numbers are between 1 and 3.
So my questions are -- am I totally misunderstanding how I can use scipy.stats.entropy to calculate the stability of my classes? If I'm not, what's the best way to find a single number entropy that tells me how good my model selection is?

How to reduce an unknown size data into a fixed size data? Please read details

Example:
Given n number of images marked 1 to n where n is unknown, I can calculate a property of every image which is a scalar quantity. Now I have to represent this property of all images in a fixed size vector (say 5 or 10).
One naive approach can be this vector- [avg max min std_deviation]
And I also want to include the effect of relative positions of those images.
What your are looking for is called feature extraction.
There are many techniques for the same, for images:
For your purpose try:
PCA
Auto-encoders
Convolutional Auto-encoders, 1 & 2
You could also look into conventional (old) methods like SIFT, HOG, Edge Detection, but they all will need an extra step for making them to a smaller-fixed size.

Resources