Spark performance in single node

Spark performance in single node - apache-spark

Im trying to execute some of the sample Python - Scikit scripts into Spark in Single node (my desktop - Mac - 8 GB). Here is my configuration
spark-env.sh file.
SPARK_MASTER_HOST='IP'
SPARK_WORKER_INSTANCES=3
SPARK_WORKER_CORES=2
Im starting my slaves
./sbin/start-slave.sh spark://IP
Workers Table in (http://localhost:8080/) shows there are 3 workers running with each 2 cores
My script file which I took it from (https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-apache-spark.html)
from sklearn import svm, grid_search, datasets
from sklearn.ensemble import RandomForestClassifier
from spark_sklearn.util import createLocalSparkSession
from pyspark import SparkContext
sc=SparkContext.getOrCreate()
digits = datasets.load_digits()
X, y = digits.data, digits.target
param_grid = {"max_depth": [3, None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"],
"n_estimators": [10, 20, 40, 80]}
gs = GridSearchCV(sc,RandomForestClassifier(), param_grid=param_grid)
gs.fit(X, y)
Submitting the script
spark-submit --master spark://IP traingrid.py
However I do not see any significant improvement in the execution time.
Is there any other configurations required to make it more parallel? Or I should add another node to improve it?

Related

Google Colab Times out

Trying to find the Jordan form of a given matrix using Colab.
But it always fails or times out.
Not sure why is this failing
import numpy as np
import sys
from sympy import Matrix
sys.set_int_max_str_digits(15000)
a = np.array([[1, 2, 4, 8], [1, 3, 9, 27], [1, 4, 16, 64], [1, 5, 25, 125]])
m = Matrix(a)
P, J = m.jordan_form()
J
I tried finding Jordan form on Matlab and on online calculators like
https://www.wolframalpha.com/input/?i=jordan+normal+form+calculator
It works fine on these platforms.
Not sure why Colab and Jupyter are not able to compute the Jordan form of the matrix

Firstly, Colab and Jupyter are simply environments in which you can run Python codes, and the issue here has nothing to do with using Colab or Jupyter or any IDE.
Secondly, the reason you do not get the results in your example is an algorithmic one. The matrix you are using is ill-conditioned. There is four orders of magnitude difference between the four egnevalues. And the underlying algorithm gets stuck while trying to calculate the Jordan form.
If you try, as an example:
a = np.array([[5, 4, 2, 1], [0, 1, -1, -1], [-1, -1, 3, 0], [1, 1, -1, 2]])
you will see you code works well and fast.

Hiding matplotlib plots while doing tests with pytest

I am writing a simple library where, given a dataset, it runs a bunch of analyses and shows a lot of plots mid-way. (There are many plt.show() calls)
I have written simple tests to check if different functions run without any error with pytest.
The problem is, once I run pytest, it starts showing all of these plots, and I have to close one by one, it takes a lot of time.
How can I silence all the plots and just see if all the tests passed or not?

If your backend supports interactive display with plt.ion(), then you will need only minimal changes (four lines) to your code:
import matplotlib.pyplot as plt
#define a keyword whether the interactive mode should be turned on
show_kw=True #<--- added
#show_kw=False
if show_kw: #<--- added
plt.ion() #<--- added
#your usual script
plt.plot([1, 2, 3], [4, 5, 6])
plt.show()
plt.plot([1, 3, 7], [4, 6, -1])
plt.show()
plt.plot([1, 3, 7], [4, 6, -1])
plt.show()
#closes all figure windows if show_kw is True
#has no effect if no figure window is open
plt.close("all") #<--- added
print("finished")
However, if the plot generation is time-consuming, this will not be feasible as it only prevents that you have to close them one by one - they will still be generated. In this case, you can switch the backend to a non-GUI version that cannot display the figures:
import matplotlib.pyplot as plt
from matplotlib import get_backend
import warnings
show_kw=True
#show_kw=False
if show_kw:
curr_backend = get_backend()
#switch to non-Gui, preventing plots being displayed
plt.switch_backend("Agg")
#suppress UserWarning that agg cannot show plots
warnings.filterwarnings("ignore", "Matplotlib is currently using agg")
plt.plot([1, 2, 3], [4, 5, 6])
plt.show()
plt.plot([1, 3, 7], [4, 6, -1])
plt.show()
plt.plot([1, 3, 7], [4, 6, -1])
plt.show()
#display selectively some plots
if show_kw:
#restore backend
plt.switch_backend(curr_backend)
plt.plot([1, 2, 3], [-2, 5, -1])
plt.show()
print("finished")

Get features for maximum value of Scikit-learn estimator

I have the following very simple code trying to model a simple dataset:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
data = {'Feature_A': [1, 2, 3, 4], 'Feature_B': [7, 8, 9, 10], 'Feature_C': [2, 3, 4, 5], 'Label': [7, 7, 8, 9]}
data = pd.DataFrame(data)
data_labels = data['Label']
data = data.drop(columns=['Label'])
pipeline = Pipeline([('imputer', SimpleImputer()),
('std_scaler', StandardScaler())])
data_prepared = pipeline.fit_transform(data)
lin_reg = LinearRegression()
lin_grid = {"n_jobs": [20, 50]}
error = "max_error"
grid_search = GridSearchCV(lin_reg, param_grid=lin_grid, verbose=3, cv=2, refit=True, scoring=error, return_train_score=True)
grid_search.fit(data_prepared, data_labels)
print(grid_search.best_estimator_.coef_)
print(grid_search.best_estimator_.intercept_)
print(list(data_labels))
print(list(grid_search.best_estimator_.predict(data_prepared)))
That gives me the following results:
[0.2608746 0.2608746 0.2608746]
7.75
[7, 7, 8, 9]
[6.7, 7.4, 8.1, 8.799999999999999]
From there, is there a way of computing the values of the features that would give me the maximum label, within the boundaries of the dataset?

If I understand your question correctly, this should work:
import numpy as np
id_max = np.argmax(grid_search.predict(data)) # find id of the maximum predicted label
print(data.loc[id_max])

Get Cluster Centers when using HDBSCAN Clustering [duplicate]

This question already has answers here:
Cluster center mean of DBSCAN in R?
(2 answers)
Closed 5 years ago.
Pretty new to clustering and trying out HDBSCAN clustering but I'm having a hard time figuring out how to get the cluster centers. With KMeans it is set with the cluster.
How do I go about getting the cluster centers?
Here's my code:
#!/usr/bin/env python3
from sklearn.cluster import KMeans
from sklearn import metrics
import cv2
import numpy as np
import hdbscan
from pprint import pprint
# Read image into opencv
image = cv2.imread('4.jpg')
# Set color space
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# reshape the image to be a list of pixels
pixels = image.reshape((image.shape[0] * image.shape[1], 3))
# Build the clusterer
cluster = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster.fit(pixels)
>>> pprint(vars(cluster))
{'_cluster_hierarchy_': <hdbscan.plots.SingleLinkageTree object at 0x110deda58>,
'_metric_kwargs': {},
'algorithm': 'best',
'alpha': 1.4142135623730951,
'core_dist_n_jobs': 4,
'cut': 0.125,
'gamma': 5,
'k': 7,
'labels_': array([ 0, 0, 0, ..., 360, 220, 172]),
'metric': 'euclidean'}
Versus this is what KMeans output gives you:
{'cluster_centers': (array([ 64.93473757, 65.65262431, 72.00103591]),
array([ 77.55381605, 85.80626223, 102.29549902]),
array([ 105.66884532, 115.81917211, 131.55555556]),
array([ 189.20149254, 197.00497512, 205.43034826]),
array([ 148.0922619 , 156.5 , 168.33333333])),
'cluster_centers_': array([[ 105.66884532, 115.81917211, 131.55555556],
[ 64.93473757, 65.65262431, 72.00103591],
[ 148.0922619 , 156.5 , 168.33333333],
[ 189.20149254, 197.00497512, 205.43034826],
[ 77.55381605, 85.80626223, 102.29549902]]),
'copy_x': True,
'inertia_': 1023155.888923295,
'init': 'k-means++',
'labels_': array([1, 1, 1, ..., 1, 1, 1], dtype=int32),
'max_iter': 300,
'n_clusters': 5,
'n_init': 10,
'n_iter_': 8,
'n_jobs': 1,
'precompute_distances': 'auto',
'random_state': None,
'tol': 0.0001,
'verbose': 0}

Clusters in (H)DBSCAN do not have centers.
The clusters may be non-convex, and if you compute the average of all points (and your data are points - they don't need to be) it may then be outside of the cluster.
Also note that DBSCAN also gives noise points,that don't have a center at all.

how i can fetching data in this array and if condition for make it be an cluster with minimum distance in python?

I will to implement kmeans in python, but i just don't know to process min distance from euclidean distance.
i have been calculate data in 3 cluster,
this is my result array :
[array([4, 5], dtype=int64), 4.1231056256176606, 0,
array([4, 8], dtype=int64), 4.4721359549995796, 0,
array([14, 23], dtype=int64), 22.022715545545239, 0,
array([4, 5], dtype=int64), 1.0, 1,
array([4, 8], dtype=int64), 2.0, 1,
array([14, 23], dtype=int64), 19.723082923316021, 1]
here its my code:
for i in range(len(centroidrandom)):
for j in range(3):
jarak_=euclidean_distances(data[j],centroidrandom[:][i])
cluster.append(data[j])
cluster.append(jarak_[0][0])
cluster.append(i)
print(cluster)

Here is some example code for kmeans clustering with three clusters, modified from the example given in the comment above:
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation for three sets of data
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2), rand(150,2) + array([0,.5])))
# computing K-Means with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
print idx
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or',
data[idx==2,0],data[idx==2,1],'oy')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark performance in single node - apache-spark

Related

Google Colab Times out

Hiding matplotlib plots while doing tests with pytest

Get features for maximum value of Scikit-learn estimator

Get Cluster Centers when using HDBSCAN Clustering [duplicate]

how i can fetching data in this array and if condition for make it be an cluster with minimum distance in python?

Categories

Resources