DBSCAN sklearn memory issues - scikit-learn

I am trying to use DBSCAN sklearn implementation for anomaly detection. It works fine for small datasets (500 x 6). However, it runs into memory issues when I try to use a large dataset (180000 x 24). Is there something I can do to overcome this issue?
from sklearn.cluster import DBSCAN
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np
data = pd.read_csv("dataset.csv")
# Drop non-continuous variables
data.drop(["x1", "x2"], axis = 1, inplace = True)
df = data
data = df.as_matrix().astype("float32", copy = False)
stscaler = StandardScaler().fit(data)
data = stscaler.transform(data)
print "Dataset size:", df.shape
dbsc = DBSCAN(eps = 3, min_samples = 30).fit(data)
labels = dbsc.labels_
core_samples = np.zeros_like(labels, dtype = bool)
core_samples[dbsc.core_sample_indices_] = True
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
df['Labels'] = labels.tolist()
#print df.head(10)
print "Number of anomalies:", -1 * (df[df.Labels < 0]['Labels'].sum())

Depending on the type of problem you are tackling could play around this parameter in the DBSCAN constructor:
leaf_size : int, optional (default = 30)
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
If that does not suit your needs, this question is already addressed here, you can try to use ELKI's DBSCAN implementation.

Related

Receiving coordinates from inference Pytorch

I'm trying to get the coordinates of the pixels inside of a mask that is generated by Pytorches DefaultPredictor, to later on get the polygon corners and use this in my application.
However, DefaultPredictor produced a tensor of pred_masks, in the following format: [False, False ... False], ... [False, False, .. False]
Where the length of each individual list is length of the image, and the number of total lists is the height of the image.
Now, as I need to get the pixel coordinates that are inside of the mask, the simple solution seemed to be looping through the pred_masks, checking the value and if == "True" creating tuples of these and adding them to a list. However, as we are talking about images with width x height of about 3200 x 1600, this is a relatively slow process (~4 seconds to loop through a single 3200x1600, yet as there are quite some objects for which I need to get the inference in the end - this will end up being incredibly slow).
What would be the smarter way to get the the coordinates (mask) of the detected object using the pytorch (detectron2) model?
Please find my code below for reference:
from __future__ import print_function
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.data import MetadataCatalog
from detectron2.data.datasets import register_coco_instances
import cv2
import time
# get image
start = time.time()
im = cv2.imread("inputImage.jpg")
# Create config
cfg = get_cfg()
cfg.merge_from_file("detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5 # Set threshold for this model
cfg.MODEL.WEIGHTS = "model_final.pth" # Set path model .pth
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1
cfg.MODEL.DEVICE='cpu'
register_coco_instances("dataset_test",{},"testval.json","Images_path")
test_metadata = MetadataCatalog.get("dataset_test")
# Create predictor
predictor = DefaultPredictor(cfg)
# Make prediction
outputs = predictor(im)
#Loop through the pred_masks and check which ones are equal to TRUE, if equal, add the pixel values to the true_cords_list
outputnump = outputs["instances"].pred_masks.numpy()
true_cords_list = []
x_length = range(len(outputnump[0][0]))
#y kordinaat on range number
for y_cord in range(len(outputnump[0])):
#x cord
for x_cord in x_length:
if str(outputnump[0][y_cord][x_cord]) == "True":
inputcoords = (x_cord,y_cord)
true_cords_list.append(inputcoords)
print(str(true_cords_list))
end = time.time()
print(f"Runtime of the program is {end - start}") # 14.29468035697937
//
EDIT:
After changing the for loop partially to compress - I've managed to reduce the runtime of the for loop by ~3x - however, ideally I would like to receive this from the predictor itself if possible.
y_length = len(outputnump[0])
x_length = len(outputnump[0][0])
true_cords_list = []
for y_cord in range(y_length):
x_cords = list(compress(range(x_length), outputnump[0][y_cord]))
if x_cords:
for x_cord in x_cords:
inputcoords = (x_cord,y_cord)
true_cords_list.append(inputcoords)
The problem is easily solvable with sufficient knowledge about NumPy or PyTorch native array handling, which allows 100x speedups compared to Python loops. You can study the NumPy library, and PyTorch tensors are similar to NumPy in behaviour.
How to get indices of values in NumPy:
import numpy as np
arr = np.random.rand(3,4) > 0.5
ind = np.argwhere(arr)[:, ::-1]
print(arr)
print(ind)
In your particular case this will be
ind = np.argwhere(outputnump[0])[:, ::-1]
How to get indices of values in PyTorch:
import torch
arr = torch.rand(3, 4) > 0.5
ind = arr.nonzero()
ind = torch.flip(ind, [1])
print(arr)
print(ind)
[::-1] and .flip are used to inverse the order of coordinates from (y, x) to (x, y).
NumPy and PyTorch even allow checking simple conditions and getting the indices of values that meet these conditions, for further understanding see the according NumPy docs article
When asking, you should provide links for your problem context. This question is actually about Facebook object detector, where they provide a nice demo Colab notebook.

How to delete the outliers

I manage to apply the interquartile range principle well but when I display the mustache box of the dataset without outliers, I see that there are always outliers. what is wrong?
Here is code :
# Load libraries
import pandas as pd;
from pandas import read_csv, set_option;
from matplotlib import pyplot as plt;
# Load dataset
filename = "/home/fogang/dataset/Regression/Housing Boston/housing.csv";
df = read_csv(filename, header=0);
df = df.drop('Unnamed: 0', axis=1); # Let's delete the column 'Unnamed: 0'
one_dim = pd.DataFrame();
one_dim['rm'] = df['rm'];
#shape dataset
print(one_dim.shape);
# Peek at dataset
print(one_dim.head(10));
# Let's look whether there are NaN values
print(one_dim.isnull().sum());
# Box and whisker plots
one_dim.plot(kind='box', subplots=True, layout=(1, 1), sharex=False, sharey=False, fontsize=12);
plt.show();
# Describe Dataset
print(one_dim.describe());
# Let's find Inter-Quartile Range
unidim = one_dim['rm'];
unidim_Q1 = unidim.quantile(0.25);
unidim_Q3 = unidim.quantile(0.75);
unidim_IQR = unidim_Q3 - unidim_Q1;
unidim_lower = unidim_Q1 - (1.5 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.5 * unidim_IQR);
# Outliers
unidim_outliers = pd.DataFrame();
unidim_outliers['outliers'] = unidim[(unidim < unidim_lower) | (unidim > unidim_upper)]
unidim_outliers.info()
# Good data
unidim_good = pd.DataFrame();
unidim_good['good'] = unidim[(unidim >= unidim_lower) & (unidim <= unidim_upper)];
unidim_good.info();
unidim_good.plot(kind='box', subplots=True, layout=(1, 2), sharex=False, sharey=False, fontsize=12);
plt.show();
What to do ?
You have too wide spread outliers from both tails - up and down. So, then you cut out some of outliers and check it again, you have new outliers in cutted data.
If you want totally get rid of outliers with one cut you can do it using more strict rule to cut, for example by so:
unidim_lower = unidim_Q1 - (1.3 * unidim_IQR);
unidim_upper = unidim_Q3 + (1.3 * unidim_IQR);
But I should warn you: not all 'outliers' are bad for the model, you shoud choose wisely what to treat as 'ouliers' and what is usefull data anyway.

How to bin a netcdf data using xarray

I have some spatiotemporal data derived from the CHIRPS Database. It is a NetCDF that contains daily precipitation for all over the world with a spatial resolution of 1x1km2. The DataSet possesses 3 dimensions ('time', 'longitude', 'latitude').
I would like to bin this precipitation data according to each pixel's coordinate ('latitude' & 'longitude') temporal distribution. Therefore, the dimension I wish to apply the binnarization is the 'time' domain.
This is a similar question already discussed in StackOverflow (see in here). The difference between their Issue and mine is that, in my case, I need to binnarize the data according to each specific pixel's temporal distribution, instead of applying a single range of values for binnarization for all my coordinates (pixels). As a consequence, I expect to have different binning thresholds ('n' sets of thresholds), one for each of the 'n' pixels in my dataset.
As far as I understand, the simplest and fastest way to apply a function over each of the coordinates (pixels) of a Xarray's DataArray/DataSet is to use the xarray.apply_ufunc.
For the binnarization, I am using the pandas qcut method, which only requires an array of values and some given relative frequency (i.e.: [0.1%, 0.5%, 25%, 99%]) in order for it to work.
Since pandas binning function requires an array of data, and it also returns another array of binnarized data, I understand that I have to use the argument "vectorize"=True in the U_function (described in here).
Finally, when I run the analysis, The resulted Xarray DataSet ends up losing the 'time' dimension after the processing. Also, I get unsure whether that processing truly returned an Xarray DataSet with data properly classified.
Here is a reproducible snippet code. Notice that the 'time' dimension of the "ds_binned" is lost. Therefore, I have to later insert the binned data back to the original xarray dataset (ds). Also notice that the dimensions are not set in proper order. That also is causing problems for my analysis.
import pandas as pd
pd.set_option('display.width', 50000)
pd.set_option('display.max_rows', 50000)
pd.set_option('display.max_columns', 5000)
import numpy as np
import xarray as xr
from dask.diagnostics import ProgressBar
ds = xr.tutorial.open_dataset('rasm').load()
def parse_datetime(time):
return pd.to_datetime([str(x) for x in time])
ds.coords['time'] = parse_datetime(ds.coords['time'].values)
def binning_function(x, distribution_type='Positive', b=False):
y = np.where(np.abs(x)==np.inf, 0, x)
y = np.where(np.isnan(y), 0, y)
if np.all(y) == 0:
return x
else:
Classified = pd.qcut(y, np.linspace(0.01, 1, 10))
return Classified.codes
def xarray_parse_extremes(ds, dim=['time'], dask='allowed', new_dim_name=['classes'], kwargs={'b': False, 'distribution_type':'Positive'}):
filtered = xr.apply_ufunc(binning_function,
ds,
dask=dask,
vectorize=True,
input_core_dims=[dim],
#exclude_dims = [dim],
output_core_dims=[new_dim_name],
kwargs=kwargs,
output_dtypes=[float],
join='outer',
dataset_fill_value=np.nan,
).compute()
return filtered
with ProgressBar():
da_binned = xarray_parse_extremes(ds['Tair'] ,
['time'],
dask='allowed')
da_binned.name = 'classes'
ds_binned = da_binned.to_dataset()
ds['classes'] = (('y', 'x', 'time'), ds_binned['classes'].values)
mask = (ds['classes'] >= 5) & (ds['classes'] != 0)
ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair'].isel({'time':-1}).plot()
print(ds)
(ds.where(mask, drop=True).resample({'time':'Y'}).count('time')['Tair']
.to_dataframe().dropna().sort_values('Tair', ascending=False)
)
delayed_to_netcdf = ds.to_netcdf(r'F:\Philipe\temp\teste_tutorial.nc',
engine='netcdf4',
compute =False)
print('saving data classified')
with ProgressBar():
delayed_to_netcdf.compute()

Ranking all features in order using scikit-learn

I am trying to sort all features in order using scikit-learn f_regression and SelectKBest. The method works well if the number of ranked features k is smaller than the total number of features n. However, if I set k = n then the output from SelectKBest will be in the same order as the original feature array. How can I sort all features in order according to their importance?
The code is below:
from sklearn.feature_selection import SelectKBest, f_regression
n = len(training_features.columns)
selector = SelectKBest(f_regression, k = n)
selector.fit(training_features.values, training_targets.values[:, 0])
k_best_features = list(training_features.columns[selector.get_support(indices = True)])
I ended up using this solution:
import numpy as np
from sklearn.feature_selection import f_regression
k = 10 # number of best features to obtain
scores, _ = f_regression(training_features.values, training_targets.values[:, 0])
indices = np.argsort(scores)[::-1]
k_best_features = list(training_features.columns.values[indices[0:k]])
I thinking sorting the featuers, with respect to the scores given by f_regression, can be generated using
pd.DataFrame(dict(feature_names= training_features.columns , scores = selector.scores_))\
.sort_values('scores',ascending = False)

How do I fix KeyError bug in my code while implementing K-Nearest Neighbours from scratch?

I am trying to implement K-Nearest Neighbours algorithm from scratch in Python. The code I wrote worked well for the Breast-Cancer-Wisconsin.csv dataset.
However, the same code when I try to run for Iris.csv dataset, my implementation fails and gives KeyError.
The only difference in the 2 datasets is the fact that in Breast-Cancer-Wisconsin.csv there are only 2 classes ('2' for malignant and '4' for benign) and both the labels are integers wheres in Iris.csv there are 3 classes ('setosa', 'versicolor', 'virginica') and all these 3 labels are in string type.
Here is the code I wrote (for Iris.csv) :
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
style.use('fivethirtyeight')
dataset = {'k':[[1,2],[2,3],[3,1]], 'r':[[6,5],[7,7],[8,6]]}
new_features = [5,7]
#[[plt.scatter(j[0],j[1], s=100, color=i) for j in dataset[i]] for i in dataset]
#plt.scatter(new_features[0], new_features[1], s=100)
#plt.show()
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
#full_data = df.astype(float).values.tolist()
#random.shuffle(full_data)
test_size = 0.2
train_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
test_set = {'setosa':[], 'versicolor':[], 'virginica':[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', correct/total)
When I run the above code, I get a KeyError message at line number 49.
Could anyone please explain to me where I am going wrong? Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
Also, how do I handle if the classes are in string type instead of integer?
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
REFERENCES
Iris.csv
Breas-Cancer-Wisconsin.csv
Let's start from your last question:
one solution I thought of was to convert all string types to integer types and try to solve but would that work?
Yes, that would work. You shouldn't have to hardcode the names of all the classes of every problem in your code. Instead, you can just write a function that reads all the different values for the class attribute, and assigns a numeric value to each different one.
Could anyone please explain to me where I am going wrong?
Most likely, the problem is that you are reading an instance whose class attribute is not 'setosa', 'versicolor', 'virginica' (something like Iris-setosa perhaps?). The idea above should fix this problem.
Also, it would be great if someone could point out how do I modify this algorithm to classify multiple classes (instead of 2 or 3) in the future?
As discuss before, you just need to avoid hard-coding the names of the classes in your code
Also, how do I handle if the classes are in string type instead of integer?
def get_class_values(data):
classes_seen = {}
for i in data:
_class = data[-1]
if _class not in classes_seen:
classes_seen[_class] = len(classes_seen)
return classes_seen
A function like this one would return a mapping between all your classes (no matter the type) and numeric codes (from 0 to N-1). Using this mapping would also solve all the problems mentioned before.
Convert String Labels In CSV Files To Integer Labels
After going through some GitHub repos I came across a very simple yet elegant piece of code that solves the above problem. Hope it helps those who have faced this problem before (beginners especially!)
% read the csv file
df = pd.read_csv('iris.csv')
% clean the data file
df.replace('?', -99999, inplace=True)
% convert the string classes into integer types.
% integers are assigned from 0 to N-1.
% species is the name of the column which has class labels.
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
% convert the data frame to list
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Post Debugging
Turns out that we need not use the above piece of code also, i.e I can get the answer without explicitly converting the string labels into integer labels (using the above code).
I have posted the original code after some minor changes (below) and the key error is now fixed. Also, I am now getting an accuracy of 97% to 100% (only on IRIS dataset).
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
That is the only change you need to make to the original code I posted in order to make it work!! Simple!
However, please note that the numbers have to be given as integers and not string (otherwise it would lead to key error!).
Wrap-Up
There are some commented lines in the original code which I thought would be good to explain in case somebody ran into some issues. Here's one snippet with the comments removed (compare with original code in the question).
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
Here's the output you get:
ValueError: could not convert string to float: 'virginica'
What went wrong?
Note that here we did not convert the string labels into integer labels. Therefore, when we tried to convert the data in the CSV to float values, the kernel threw an error because a string cannot be converted to float!
So one way to go about it is that you don't convert the data into floating point values and then you won't get this error. However in many cases you need to convert all the data into floating point (for eg.. normalisation, accuracy, long mathematical calculations, prevention of loss of precision etc etc..).
Hence after heavy debugging and going through a lot of articles I finally came up with a simple version of the original code (below):
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from matplotlib import style
from collections import Counter
import warnings
import pandas as pd
import random
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features) - np.array(predict))
distances.append([euclidean_distance, group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('iris.csv')
df.replace('?', -99999, inplace=True)
df['species'] = df['species'].astype('category')
df['species_value'] = df['species'].cat.codes
df.drop(['species'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {0:[], 1:[], 2:[]}
test_set = {0:[], 1:[], 2:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy : ', (correct/total)*100,'%')
Hope this helps!

Resources