I have been looking at clustering infrared spectroscopy data with the sklearn clustering methods. I am having trouble getting the clustering to work with the data, since I'm new to this I don't know if the way I'm coding it is wrong or my approach is wrong.
My data, in Pandas DataFrame format, looks like this:
Index Wavenumbers (cm-1) %Transmission_i ...
0 650 100 ...
. . . ...
. . . ...
. . . ...
n 4000 95 ...
where, the x-axis for all spectra is the Wavenumbers (cm-1) column and the subsequent columns (%Transmission_i) are the actual data. I want to cluster these columns (in terms of which spectra are most similar to each other), as such I am trying this code:
X = np.array([list(df[x].values) for x in df.set_index(x)])
clusters = DBSCAN().fit(X)
where df is my DataFrame, and np is numpy (hopefully obvious). The problem is when I print out the cluster labels it just spits out nothing but -1 which means all my data is noise. This isn't the case, when I plot my data I can clearly see a some spectra look very similar (as they should).
How can I get the similar spectra to be clustered properly?
EDIT:
Here is a minimum working example.
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
x = 'x-vals'
def cluster_data(df):
avg_list = []
dif_list = []
for col in df:
if x == col:
continue
avg_list.append(np.mean(df[col].values))
dif_list.append(np.mean(np.diff(df[col].values)))
a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([dif_list], norm='max')[0]
X = []
for i,j in zip(a,b):
X.append([i,j])
X = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)
return clusters.labels_
def plot_clusters(df, clusters):
colors = ['red', 'green', 'blue', 'black', 'pink']
i = 0
for col in df:
if col == x:
continue
color = colors[clusters[i]]
plt.plot(df[x], df[col], color=color)
i +=1
plt.show()
x1 = np.linspace(-np.pi, np.pi, 201)
y1 = np.sin(x1) + 1
y2 = np.cos(x1) + 1
y3 = np.zeros_like(x1) + 2
y4 = np.zeros_like(x1) + 1.9
y5 = np.zeros_like(x1) + 1.8
y6 = np.zeros_like(x1) + 1.7
y7 = np.zeros_like(x1) + 1
y8 = np.zeros_like(x1) + 0.9
y9 = np.zeros_like(x1) + 0.8
y10 = np.zeros_like(x1) + 0.7
df = pd.DataFrame({'x-vals':x1, 'y1':y1, 'y2':y2, 'y3':y3, 'y4':y4,
'y5':y5, 'y6':y6, 'y7':y7, 'y8':y8, 'y9':y9,
'y10':y10})
clusters = cluster_data(df)
plot_clusters(df, clusters)
This produces the following plot, where red is a cluster and pink is noise.
I was able to get a method working, but I'm not fully convinced this is the best method for clustering IR spectra.
First I run through all the spectra and compile a list of the mean and mean of the first derivative of each spectra. The mean is supposed to be representative of the vertical location of the spectra, while the mean of the first derivative is supposed to be representative of the shape of the spectra.
avg_list = []
dif_list = []
for col in df:
if x == col:
continue
avg_list.append(np.mean(df[col].values))
dif_list.append(np.mean(np.dif(df[col].values)))
Then I normalize each list, this is so I can pick a eps value based on percent changes.
a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([diff_list], norm='max')[0]
After that I make a 2D array for runnning DBSCAN in 2D mode.
X = []
for i,j in zip(a,b):
X.append([i,j])
Then I run the DBSCAN clustering method with an arbitrary percent difference value for the eps parameter.
X = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)
Then clusters.labels_ returns an array with the length of the number of spectra in my DataFrame. It works fairly well, but it is rather exclusive and the clusters could be better. Some more fine tuning would be helpful.
First, transpose your dataframe, so that you have the datapoints as rows as is the standard. It should look like this:
Index 650 660 ... 4000
0 100 98 ... 95
1 . . ... .
. . . ... .
n . . ... .
Then you get your X for the clustering like that:
X = df.values
Next, you cluster:
from sklearn.cluster import DBSCAN
cluster = DBSCAN().fit(X)
print(cluster.labels_)
As a recommendation for spectral data, kmeans (disadvantage: you need to set the number of clusters beforehand) and self-organizing maps (disadvantage: soft clusters instead of hard clusters) work quite well. For example, you find an example here for clustering on hyperspectral data.
Related
I am using python3, Scipy
I have a 3d points (x,y,z]
From them I make s apline using scipy.interpolate.splprep
x_points = np.linspace(0, 2*np.pi, 10)
y_points = np.sin(x_points)
z_points = np.cos(x_points)
path = np.vstack([x_points, y_points, z_points])
tck, u = sc.splprep(path, k=3, s=0)
I wish to get the coefficients of the spline[i]:
For example the latest splins:
sp9 = a9 + b9(x-x4) + c9(x-x4)^2 + d9(x-x4)^3
I know that the tck is (t,c,k) a tuple containing the vector of knots, the B-spline coefficients, and the degree of the spline.
But I don't see how I can get this spline function and plot only it
I tried using this method:
import numpy as np
import scipy.interpolate as sc
x_points = np.linspace(0, 2*np.pi, 10)
y_points = np.sin(x_points)
z_points = np.cos(x_points)
path = np.vstack([x_points, y_points, z_points])
tck, u = sc.splprep(path, k=3, s=0)
p = sc.PPoly.from_spline(tck)
but I'm getting this error on the last line:
p = sc.PPoly.from_spline(tck) File
"C:\Users...\Python38\lib\site-packages\scipy\interpolate\interpolate.py",
line 1314, in from_spline cvals = np.empty((k + 1, len(t)-1),
dtype=c.dtype)
AttributeError: 'list' object has no attribute 'dtype'
The coefficients in the tck tuple are in the b-spline basis. If you want to convert them to the power basis, you can do PPoly.from_spline(tck) .
An obligatory note however: converting between bases incurs numerical errors.
EDIT. First, as it's splprep, you'll need to convert the list-of-arrays c into a proper numpy array and transpose (it's a known wart of splPrep). Then, as it turns out, PPoly.from_spline does not handle multidimensional c (this might be a nice pull request to the scipy repository), so you'll need to e.g. loop over the dimensions. Something along the lines of (continuing from your OP)
t, c, k = tck
cc = np.asarray(c) # cc.shape is (3, 10) now
spl0 = sc.PPoly.from_spline((t, cc.T[0], k))
print(spl0.c) # here are your coefficients for the component 0
I want to make a few tools to help to learn and teach basic statistic. One of them aims to help visualise z-score probability table:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
import scipy.stats as st
def draw_z_score(x, cond, mean, std, title,color='b'):
y = st.norm.pdf(x, mean, std)
z = x[cond]
plt.plot(x, y)
plt.ylim(ymin=0)
plt.xlim(xmin=-4.5, xmax=4.5)
plt.fill_between(z, 0, st.norm.pdf(z, mean, std),color=color)
plt.title(title)
plt.tight_layout()
plt.show()
def z_table_probabilty (z_score, z_score2=None, area='l'):
normal = np.arange(-3.9, 3.9, 0.1)
if area == 'l':
Pz = round(st.norm.cdf(z_score), 4)
draw_z_score(normal,normal<z_score,0,1,f'z = {z_score} P(z)={Pz}')
elif area == 'r':
Pz = round(1 - st.norm.cdf(z_score), 4)
draw_z_score(normal,normal>z_score,0,1,f'Z ={z_score} P(1-z)={Pz}',color='r')
elif area == 'tt' and z_score2 != None:
z2 = max(z_score, z_score2)
z = min(z_score, z_score2)
Pz = round(st.norm.cdf(z2) - st.norm.cdf(z), 4)
draw_z_score(normal,(normal<z2)&(normal>z),0, 1, f'z= {z} i z\'= {z2} P(z\'-z)={Pz}', color='y')
Now, when I try:
z_table_probabilty(-0.9)
I have:
z-score=-0.9
Could someone tell my why z-score -0.9 is equal 1 on my plot?, and why the distances between x=4 and end of distribution tail, and x=-4 and end of other tail are different? The whole plot seems to be slightly moved.
What have I done wrong?
Thanks
MV
normal = np.arange(-3.9, 3.9, 0.1) creates an array from -3.9 to 3.8. The end point is not included. Hence you see the curve start at -3.9 and end at 3.8.
With normal<z_score you choose all the points in normal which are smaller than z_score. When z_score=-0.9, those points are from -3.9 to -1.0 because -0.9 is not smaller than -0.9.
In total I would recommend defining normal a bit more dense. This would avoid the two problems. E.g.
normal = np.linspace(-3.9, 3.9, 391)
to create points in steps of 0.02 instead of 0.1.
I have a dataframe called 'games':
Game_id Goals P_value
1 2 0.4
2 3 0.321
45 0 0.64
I need to split the P value to 0.05 steps, bin the rows per P value and than create a line graph that shows the sum per p value.
What I currently have:
games.set_index('p value', inplace=True)
games.sort_index()
np.cumsum(games['goals']).plot()
But I get this:
No matter what I tried, I couldn't group the P values and show the sum of goals per P value..
I also tried to use matplotlib.pyplot but than I couldn't use the cumsum function..
If I understood you correctly, you want to have discrete steps in the p-value of width 0.05 and show the cumulative sum?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# create some random example data
df = pd.DataFrame({
'goals': np.random.poisson(3, size=1000),
'p_value': np.random.uniform(0, 1, size=1000)
})
# define binning in p-value
bin_edges = np.arange(0, 1.025, 0.05)
bin_center = 0.5 * (bin_edges[:-1] + bin_edges[1:])
bin_width = np.diff(bin_edges)
# find the p_value bin, each row belongs to
# 0 is underflow, len(edges) is overflow bin
df['bin'] = np.digitize(df['p_value'], bins=bin_edges)
# get the number of goals per p_value bin
goals_per_bin = df.groupby('bin')['goals'].sum()
print(goals_per_bin)
# not every bin might be filled, so we will use pandas index
# matching t
binned = pd.DataFrame({
'center': bin_center,
'width': bin_width,
'goals': np.zeros(len(bin_center))
}, index=np.arange(1, len(bin_edges)))
binned['goals'] = goals_per_bin
plt.step(
binned['center'],
binned['goals'],
where='mid',
)
plt.xlabel('p-value')
plt.ylabel('goals')
plt.show()
I'm trying to find the find the probability distribution that better fits my data. I've tried with the code I've found in different threads, but the results are not what I'm expecting.
The descriptive statistics and histogram for my data are as follows:
Data Histogram
count 865.000000
mean 43.476713
std 12.486362
min 4.075682
25% 34.934609
50% 41.917304
75% 51.271708
max 88.843940
I tried to find a proper distribution function using the following code, but the results are not what I expected.
size = 865
kappa=99
x = scipy.arange(size)
y = scipy.int_(scipy.round_(st.vonmises.rvs(kappa,size=size)*100))
h = plt.hist(df['spreadMaizChicagoAtlantico'],bins=100,color='b')
dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
Data histogram with functions
Can Anyone please tell me what I'm doing wrong and guide me through a better understanding of this solutions.
Thanks to the reply from before I found my mistake.
I got all the values from the DataFrame and made a numpy array.
ser=df.values
Then I ran a similar code from before correcting the fitting of the distribution to the proper data
size = 867
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*60))
h = plt.hist(ser, bins=range(80))
dist_names = ['beta', 'rayleigh', 'norm']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(ser)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
The result is as follows, showing the histogram and three probability density functions.
The distfit library can do this job as it searches for the best fit among 89 theoretical distributions.
pip install distfit
import numpy as np
from distfit import distfit
# Example data
X = np.random.normal(10, 3, 2000)
# Initialize
dfit = distfit()
# Search for best theoretical fit on your empirical data
dfit.fit_transform(X)
# The plot function will now also include the predictions of y
dfit.plot(chart='PDF',
emp_properties={'linewidth': 4, 'color': 'k'},
bar_properties={'edgecolor':'k', 'color':'g'},
pdf_properties={'linewidth': 4, 'color': 'r'})
I have reviewed the response to this question: How would I iterate over a list of files and plot them as subplots on a single figure?
But am none the wiser on how to achieve my goal. I would like to plot multiple data sets, with differing x axes, onto a single figure in Python. I have included a snippet of my code below, which performs an FFT on a dataset, then calculates 3 Butterworth filter outputs. Ideally I would like to have all plotted on a single figure, which I have attempted to achieve in the code below.
The for loop calculates the 3 Butterworth filter outputs, the code above - the FFT and the code directly below attempts to append the FFT curve and sqrt(0.5) line to the previously generated plots for display.
Any Direction or advice would be appreciated.
"""Performs a Fast Fourier Transform on the data specified at the base of the code"""
def FFT(col):
x = io2.loc[1:,'Time']
y = io2.loc[1:,col]
# Number of samplepoints
#N = 600
N = pd.Series.count(x)
N2 = int(N/2)
# sample spacing
#T = 1.0 / 800.0
T = 1/(io2.loc[2,'Time'] - io2.loc[1,'Time'])
#x = np.linspace(0.0, N*T, N)
#y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = scipy.fftpack.fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N2)
fig=plt.figure()
plt.clf()
i=1
for order in [3, 6, 9]:
ax=fig.add_subplot(111, label="order = %d" % order)
b, a = butter_lowpass(cutoff, fs, order=order)
w, h = freqz(b, a, worN=2000)
ax.plot((fs * 0.5 / np.pi) * w, abs(h))
i=i+1
ax4=fig.add_subplot(111, label='sqrt(0.5)', frame_on=False)
ax5=fig.add_subplot(111, label="FFT of "+col, frame_on=False)
ax4.plot([0, 0.5 * fs], [np.sqrt(0.5), np.sqrt(0.5)], '--')
ax5.plot(xf, 2.0/N * np.abs(yf[:N2]))
plt.xlabel('Frequency (Hz)')
plt.ylabel('Gain')
plt.grid(True)
plt.legend(loc='best')
#fig, ax = plt.subplots()
#ax.plot(xf, 2.0/N * np.abs(yf[:N2]), label="FFT of "+col)
plt.axis([0,5000,0,0.1])
#plt.xlabel('Frequency (Hz)')
#plt.ylabel('Amplitude (mm)')
#plt.legend(loc=0)
plt.show()
return
Kind Regards,
Here you can find a minimal example of how to plot multiple lines with different x and y datasets. You are recreating the plot every time you type add_subplot(111). Instead, you should call plot multiple times. I have added an example for a single plot with multiple lines, as well as an example for one subplot per line.
import numpy as np
import matplotlib.pyplot as plt
x1 = np.arange(0, 10, 1)
x2 = np.arange(3, 12, 0.1)
x3 = np.arange(2, 8, 0.01)
y1 = np.sin(x1)
y2 = np.cos(x2**0.8)
y3 = np.sin(4.*x3)**3
data = []
data.append((x1, y1, 'label1'))
data.append((x2, y2, 'label2'))
data.append((x3, y3, 'label3'))
# All lines in one plot.
plt.figure()
for n in data:
plt.plot(n[0], n[1], label=n[2])
plt.legend(loc=0, frameon=False)
# One subplot per data set.
cols = 2
rows = len(data)//2 + len(data)%2
plt.figure()
gs = plt.GridSpec(rows, cols)
for n in range(len(data)):
i = n%2
j = n//2
plt.subplot(gs[j,i])
plt.plot(data[n][0], data[n][1])
plt.title(data[n][2])
plt.tight_layout()
plt.show()