How to delete the outliers - python-3.x

I manage to apply the interquartile range principle well but when I display the mustache box of the dataset without outliers, I see that there are always outliers. what is wrong?
Here is code :
# Load libraries
import pandas as pd;
from pandas import read_csv, set_option;
from matplotlib import pyplot as plt;
# Load dataset
filename = "/home/fogang/dataset/Regression/Housing Boston/housing.csv";
df = read_csv(filename, header=0);
df = df.drop('Unnamed: 0', axis=1); # Let's delete the column 'Unnamed: 0'
one_dim = pd.DataFrame();
one_dim['rm'] = df['rm'];
#shape dataset
print(one_dim.shape);
# Peek at dataset
print(one_dim.head(10));
# Let's look whether there are NaN values
print(one_dim.isnull().sum());
# Box and whisker plots
one_dim.plot(kind='box', subplots=True, layout=(1, 1), sharex=False, sharey=False, fontsize=12);
plt.show();
# Describe Dataset
print(one_dim.describe());
# Let's find Inter-Quartile Range
unidim = one_dim['rm'];
unidim_Q1 = unidim.quantile(0.25);
unidim_Q3 = unidim.quantile(0.75);
unidim_IQR = unidim_Q3 - unidim_Q1;
unidim_lower = unidim_Q1 - (1.5 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.5 * unidim_IQR);
# Outliers
unidim_outliers = pd.DataFrame();
unidim_outliers['outliers'] = unidim[(unidim < unidim_lower) | (unidim > unidim_upper)]
unidim_outliers.info()
# Good data
unidim_good = pd.DataFrame();
unidim_good['good'] = unidim[(unidim >= unidim_lower) & (unidim <= unidim_upper)];
unidim_good.info();
unidim_good.plot(kind='box', subplots=True, layout=(1, 2), sharex=False, sharey=False, fontsize=12);
plt.show();
What to do ?

You have too wide spread outliers from both tails - up and down. So, then you cut out some of outliers and check it again, you have new outliers in cutted data.
If you want totally get rid of outliers with one cut you can do it using more strict rule to cut, for example by so:
unidim_lower = unidim_Q1 - (1.3 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.3 * unidim_IQR);
But I should warn you: not all 'outliers' are bad for the model, you shoud choose wisely what to treat as 'ouliers' and what is usefull data anyway.

Related

How to Cluster Infrared Spectroscopy Data with Python

I have been looking at clustering infrared spectroscopy data with the sklearn clustering methods. I am having trouble getting the clustering to work with the data, since I'm new to this I don't know if the way I'm coding it is wrong or my approach is wrong.
My data, in Pandas DataFrame format, looks like this:
Index Wavenumbers (cm-1) %Transmission_i ...
0 650 100 ...
. . . ...
. . . ...
. . . ...
n 4000 95 ...
where, the x-axis for all spectra is the Wavenumbers (cm-1) column and the subsequent columns (%Transmission_i) are the actual data. I want to cluster these columns (in terms of which spectra are most similar to each other), as such I am trying this code:
X = np.array([list(df[x].values) for x in df.set_index(x)])
clusters = DBSCAN().fit(X)
where df is my DataFrame, and np is numpy (hopefully obvious). The problem is when I print out the cluster labels it just spits out nothing but -1 which means all my data is noise. This isn't the case, when I plot my data I can clearly see a some spectra look very similar (as they should).
How can I get the similar spectra to be clustered properly?
EDIT:
Here is a minimum working example.
import numpy as np
import pandas as pd
import sklearn as sk
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
x = 'x-vals'
def cluster_data(df):
avg_list = []
dif_list = []
for col in df:
if x == col:
continue
avg_list.append(np.mean(df[col].values))
dif_list.append(np.mean(np.diff(df[col].values)))
a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([dif_list], norm='max')[0]
X = []
for i,j in zip(a,b):
X.append([i,j])
X = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)
return clusters.labels_
def plot_clusters(df, clusters):
colors = ['red', 'green', 'blue', 'black', 'pink']
i = 0
for col in df:
if col == x:
continue
color = colors[clusters[i]]
plt.plot(df[x], df[col], color=color)
i +=1
plt.show()
x1 = np.linspace(-np.pi, np.pi, 201)
y1 = np.sin(x1) + 1
y2 = np.cos(x1) + 1
y3 = np.zeros_like(x1) + 2
y4 = np.zeros_like(x1) + 1.9
y5 = np.zeros_like(x1) + 1.8
y6 = np.zeros_like(x1) + 1.7
y7 = np.zeros_like(x1) + 1
y8 = np.zeros_like(x1) + 0.9
y9 = np.zeros_like(x1) + 0.8
y10 = np.zeros_like(x1) + 0.7
df = pd.DataFrame({'x-vals':x1, 'y1':y1, 'y2':y2, 'y3':y3, 'y4':y4,
'y5':y5, 'y6':y6, 'y7':y7, 'y8':y8, 'y9':y9,
'y10':y10})
clusters = cluster_data(df)
plot_clusters(df, clusters)
This produces the following plot, where red is a cluster and pink is noise.
I was able to get a method working, but I'm not fully convinced this is the best method for clustering IR spectra.
First I run through all the spectra and compile a list of the mean and mean of the first derivative of each spectra. The mean is supposed to be representative of the vertical location of the spectra, while the mean of the first derivative is supposed to be representative of the shape of the spectra.
avg_list = []
dif_list = []
for col in df:
if x == col:
continue
avg_list.append(np.mean(df[col].values))
dif_list.append(np.mean(np.dif(df[col].values)))
Then I normalize each list, this is so I can pick a eps value based on percent changes.
a = sk.preprocessing.normalize([avg_list], norm='max')[0]
b = sk.preprocessing.normalize([diff_list], norm='max')[0]
After that I make a 2D array for runnning DBSCAN in 2D mode.
X = []
for i,j in zip(a,b):
X.append([i,j])
Then I run the DBSCAN clustering method with an arbitrary percent difference value for the eps parameter.
X = np.array(X)
clusters = DBSCAN(eps=0.2).fit(X)
Then clusters.labels_ returns an array with the length of the number of spectra in my DataFrame. It works fairly well, but it is rather exclusive and the clusters could be better. Some more fine tuning would be helpful.
First, transpose your dataframe, so that you have the datapoints as rows as is the standard. It should look like this:
Index 650 660 ... 4000
0 100 98 ... 95
1 . . ... .
. . . ... .
n . . ... .
Then you get your X for the clustering like that:
X = df.values
Next, you cluster:
from sklearn.cluster import DBSCAN
cluster = DBSCAN().fit(X)
print(cluster.labels_)
As a recommendation for spectral data, kmeans (disadvantage: you need to set the number of clusters beforehand) and self-organizing maps (disadvantage: soft clusters instead of hard clusters) work quite well. For example, you find an example here for clustering on hyperspectral data.

Multivariate binary sequence prediction with CRF

this question is an extension of this one which focuses on LSTM as opposed to CRF. Unfortunately, I do not have any experience with CRFs, which is why I'm asking these questions.
Problem:
I would like to predict a sequence of binary signal for multiple, non-independent groups. My dataset is moderately small (~1000 records per group), so I would like to try a CRF model here.
Available data:
I have a dataset with the following variables:
Timestamps
Group
Binary signal representing activity
Using this dataset I would like to forecast group_a_activity and group_b_activity which are both 0 or 1.
Note that the groups are believed to be cross-correlated and additional features can be extracted from timestamps -- for simplicity we can assume that there is only 1 feature we extract from the timestamps.
What I have so far:
Here is the data setup that you can reproduce on your own machine.
# libraries
import re
import numpy as np
import pandas as pd
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
df.head() # check it out
# shift (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
[5 rows x 15 columns]
Before we get to the CRF part, I suspect that I cannot use approach this problem from a multi-task learning point of view (predicting patterns for both A and B via one model) and therefore I'm going to have to predict each of them individually.
Now the CRF part. I've found some relevant example (here is one) but they all tend to predict a single class value based on a prior sequence.
Here is my attempt at using a CRF here:
import pycrfsuite
crf_features = [] # a container for features
crf_labels = [] # a container for response
# lets focus on group A only for this one
current_response = [c for c in df.columns if c.startswith('a_next')]
# predictors are going to have to be nested otherwise I'll run into problems with dimensions
current_predictors = [c for c in df.columns if not 'next' in c]
current_predictors = set([re.sub('_\d+$','',v) for v in current_predictors])
for index, row in df.iterrows():
# not sure if its an effective way to iterate over a DF...
iter_features = []
for p in current_predictors:
pred_feature = []
# note that 0/1 values have to be converted into booleans
for k in range(shift_length):
iter_pred_feature = p + '_{0:02d}'.format(k+1)
pred_feature.append(p + "=" + str(bool(row[iter_pred_feature])))
iter_features.append(pred_feature)
iter_response = [row[current_response].apply(lambda z: str(bool(z))).tolist()]
crf_labels.extend(iter_response)
crf_features.append(iter_features)
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(crf_features, crf_labels):
trainer.append(xseq, yseq)
trainer.set_params({
'c1': 0.0, # coefficient for L1 penalty
'c2': 0.0, # coefficient for L2 penalty
'max_iterations': 10, # stop earlier
# include transitions that are possible, but not observed
'feature.possible_transitions': True
})
trainer.train('testcrf.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('testcrf.crfsuite')
tagger.tag(xseq)
# ['False', 'True', 'False']
It seems that I did manage to get it working, but I'm not sure if I've approached it correctly. I'll formulate my questions in the Questions section, but first, here is an alternative approach using keras_contrib package:
from keras import Sequential
from keras_contrib.layers import CRF
from keras_contrib.losses import crf_loss
# we are gonna have to revisit data prep stage again
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
model = Sequential()
model.add(CRF(2, input_shape=(predictor_array.shape[1],predictor_array.shape[2])))
model.summary()
model.compile(loss=crf_loss, optimizer='adam', metrics=['accuracy'])
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict(predictor_array) # not gonna worry about train/test split here
Questions:
My main question is whether or not I've constructed both of my CRF models correctly. What worries me is that (1) there is not a lot of documentation out there on CRF models, (2) CRFs are mainly used for predicting a single label given a sequence, (3) the input features are nested and (4) when used in a multi-tasked fashion, I'm not sure if it is valid.
I have a few extra questions as well:
Is a CRF appropriate for this problem?
How are the 2 approaches (one based on pycrfuite and one based on keras_contrib) different and what are their advantages/disadvantages?
In a more general sense, what is the advantage of combining CRF and LSTM models into one (like one discussed here)
Many thanks!

How to plot multi-index, categorical data?

Given the following data:
DC,Mode,Mod,Ven,TY1,TY2,TY3,TY4,TY5,TY6,TY7,TY8
Intra,S,Dir,C1,False,False,False,False,False,True,True,False
Intra,S,Co,C1,False,False,False,False,False,False,False,False
Intra,M,Dir,C1,False,False,False,False,False,False,True,False
Inter,S,Co,C1,False,False,False,False,False,False,False,False
Intra,S,Dir,C2,False,True,True,True,True,True,True,False
Intra,S,Co,C2,False,False,False,False,False,False,False,False
Intra,M,Dir,C2,False,False,False,False,False,False,False,False
Inter,S,Co,C2,False,False,False,False,False,False,False,False
Intra,S,Dir,C3,False,False,False,False,True,True,False,False
Intra,S,Co,C3,False,False,False,False,False,False,False,False
Intra,M,Dir,C3,False,False,False,False,False,False,False,False
Inter,S,Co,C3,False,False,False,False,False,False,False,False
Intra,S,Dir,C4,False,False,False,False,False,True,False,True
Intra,S,Co,C4,True,True,True,True,False,True,False,True
Intra,M,Dir,C4,False,False,False,False,False,True,False,True
Inter,S,Co,C4,True,True,True,False,False,True,False,True
Intra,S,Dir,C5,True,True,False,False,False,False,False,False
Intra,S,Co,C5,False,False,False,False,False,False,False,False
Intra,M,Dir,C5,True,True,False,False,False,False,False,False
Inter,S,Co,C5,False,False,False,False,False,False,False,False
Imports:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
To reproduce my DataFrame, copy the data then use:
df = pd.read_clipboard(sep=',')
I'd like to create a plot conveying the same information as my example, but not necessarily with the same shape (I'm open to suggestions). I'd also like to hover over the color and have the appropriate Ven displayed (e.g. C1, not 1).:
Edit 2018-10-17:
The two solutions provided so far, are helpful and each accomplish a different aspect of what I'm looking for. However, the key issue I'd like to resolve, which wasn't explicitly stated prior to this edit, is the following:
I would like to perform the plotting without converting Ven to an int; this numeric transformation isn't practical with the real data. So the actual scope of the question is to plot all categorical data with two categorical axes.
The issue I'm experiencing is the data is categorical and the y-axis is multi-indexed.
I've done the following to transform the DataFrame:
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
Plotting the transformed DataFrame produces:
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()
This plot isn't very streamlined, there are four axis values for each Ven. This is a subset of data, so the graph would be very long with all the data.
Here's my solution. Instead of plotting I just apply a style to the DataFrame, see https://pandas.pydata.org/pandas-docs/stable/style.html
# Transform Ven values from "C1", "C2" to 1, 2, ..
df['Ven'] = df['Ven'].str[1]
# Given a specific combination of dc, mode, mod, ven,
# do we have any True cells?
g = df.groupby(['DC', 'Mode', 'Mod', 'Ven']).any()
# Let's drop any rows with only False values
g = g[g.any(axis=1)]
# Convert True, False to 1, 0
g = g.astype(int)
# Get the values of the ven index as an int array
# Note: we don't want to drop the ven index!!
# Otherwise styling won't work
ven = g.index.get_level_values('Ven').values.astype(int)
# Multiply 1 and 0 with Ven value
g = g.mul(ven, axis=0)
# Sort the index
g.sort_index(ascending=False, inplace=True)
# Now display the dataframe with styling
# first we get a color map
import matplotlib
cmap = matplotlib.cm.get_cmap('tab10')
def apply_color_map(val):
# hide the 0 values
if val == 0:
return 'color: white; background-color: white'
else:
# for non-zero: get color from cmap, convert to hexcode for css
s = "color:white; background-color: " + matplotlib.colors.rgb2hex(cmap(val))
return s
g
g.style.applymap(apply_color_map)
The available matplotlib colormaps can be seen here: Colormap reference, with some additional explanation here: Choosing a colormap
Explanation: Remove rows where TY1-TY8 are all nan to create your plot. Refer to this answer as a starting point for creating interactive annotations to display Ven.
The below code should work:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_clipboard(sep=',')
# replace False witn nan
df = df.replace(False, np.nan)
# replace True with a number representing Ven (e.g. C1 = 1)
def rep_ven(row):
return row.iloc[4:].replace(True, int(row.Ven[1]))
df.iloc[:, 4:] = df.apply(rep_ven, axis=1)
# drop the Ven column
df = df.drop(columns=['Ven'])
idx = df[['TY1','TY2', 'TY3', 'TY4','TY5','TY6','TY7','TY8']].dropna(thresh=1).index.values
df = df.loc[idx,:].sort_values(by=['DC', 'Mode','Mod'], ascending=False)
# set multi-index
df_m = df.set_index(['DC', 'Mode', 'Mod'])
plt.figure(figsize=(20,10))
heatmap = plt.imshow(df_m)
plt.xticks(range(len(df_m.columns.values)), df_m.columns.values)
plt.yticks(range(len(df_m.index)), df_m.index)
plt.show()

Fitting distribution functions to dataset in Python 3

I'm trying to find the find the probability distribution that better fits my data. I've tried with the code I've found in different threads, but the results are not what I'm expecting.
The descriptive statistics and histogram for my data are as follows:
Data Histogram
count 865.000000
mean 43.476713
std 12.486362
min 4.075682
25% 34.934609
50% 41.917304
75% 51.271708
max 88.843940
I tried to find a proper distribution function using the following code, but the results are not what I expected.
size = 865
kappa=99
x = scipy.arange(size)
y = scipy.int_(scipy.round_(st.vonmises.rvs(kappa,size=size)*100))
h = plt.hist(df['spreadMaizChicagoAtlantico'],bins=100,color='b')
dist_names = ['gamma', 'beta', 'rayleigh', 'norm', 'pareto']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(y)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
Data histogram with functions
Can Anyone please tell me what I'm doing wrong and guide me through a better understanding of this solutions.
Thanks to the reply from before I found my mistake.
I got all the values from the DataFrame and made a numpy array.
ser=df.values
Then I ran a similar code from before correcting the fitting of the distribution to the proper data
size = 867
x = scipy.arange(size)
y = scipy.int_(scipy.round_(scipy.stats.vonmises.rvs(5,size=size)*60))
h = plt.hist(ser, bins=range(80))
dist_names = ['beta', 'rayleigh', 'norm']
for dist_name in dist_names:
dist = getattr(scipy.stats, dist_name)
param = dist.fit(ser)
pdf_fitted = dist.pdf(x, *param[:-2], loc=param[-2], scale=param[-1]) * size
plt.plot(pdf_fitted, label=dist_name)
plt.xlim(0,100)
plt.legend(loc='upper right')
plt.show()
The result is as follows, showing the histogram and three probability density functions.
The distfit library can do this job as it searches for the best fit among 89 theoretical distributions.
pip install distfit
import numpy as np
from distfit import distfit
# Example data
X = np.random.normal(10, 3, 2000)
# Initialize
dfit = distfit()
# Search for best theoretical fit on your empirical data
dfit.fit_transform(X)
# The plot function will now also include the predictions of y
dfit.plot(chart='PDF',
emp_properties={'linewidth': 4, 'color': 'k'},
bar_properties={'edgecolor':'k', 'color':'g'},
pdf_properties={'linewidth': 4, 'color': 'r'})

DBSCAN sklearn memory issues

I am trying to use DBSCAN sklearn implementation for anomaly detection. It works fine for small datasets (500 x 6). However, it runs into memory issues when I try to use a large dataset (180000 x 24). Is there something I can do to overcome this issue?
from sklearn.cluster import DBSCAN
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
data = pd.read_csv("dataset.csv")
# Drop non-continuous variables
data.drop(["x1", "x2"], axis = 1, inplace = True)
df = data
data = df.as_matrix().astype("float32", copy = False)
stscaler = StandardScaler().fit(data)
data = stscaler.transform(data)
print "Dataset size:", df.shape
dbsc = DBSCAN(eps = 3, min_samples = 30).fit(data)
labels = dbsc.labels_
core_samples = np.zeros_like(labels, dtype = bool)
core_samples[dbsc.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
df['Labels'] = labels.tolist()
#print df.head(10)
print "Number of anomalies:", -1 * (df[df.Labels < 0]['Labels'].sum())
Depending on the type of problem you are tackling could play around this parameter in the DBSCAN constructor:
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
If that does not suit your needs, this question is already addressed here, you can try to use ELKI's DBSCAN implementation.

Resources