PYSPARK: how to cluster more efficiently? [duplicate] - apache-spark

I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it.
Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training vectors and count the label which appear the most.
Is there another way to do this?
Thank you

You are right, this info is not provided by the model, and you have to run predict. Here is an example of doing so in a parallelized way (Spark v. 1.5.1):
from pyspark.mllib.clustering import KMeans
from numpy import array
data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0, 10.0, 9.0]).reshape(5, 2)
data
# array([[ 0., 0.],
# [ 1., 1.],
# [ 9., 8.],
# [ 8., 9.],
# [ 10., 9.]])
k = 2 # no. of clusters
model = KMeans.train(
sc.parallelize(data), k, maxIterations=10, runs=30, initializationMode="random",
seed=50, initializationSteps=5, epsilon=1e-4)
cluster_ind = model.predict(sc.parallelize(data))
cluster_ind.collect()
# [1, 1, 0, 0, 0]
cluster_ind is an RDD of the same cardinality with our initial data, and it shows which cluster each datapoint belongs to. So, here we have two clusters, one with 3 datapoints (cluster 0) and one with 2 datapoints (cluster 1). Notice that we have run the prediction method in a parallel fashion (i.e. on an RDD) - collect() is used here only for our demonstration purposes, and it is not needed in a 'real' situation.
Now, we can get the cluster sizes with
cluster_sizes = cluster_ind.countByValue().items()
cluster_sizes
# [(0, 3), (1, 2)]
From this, we can get the maximum cluster index & size as
from operator import itemgetter
max(cluster_sizes, key=itemgetter(1))
# (0, 3)
i.e. our biggest cluster is cluster 0, with a size of 3 datapoints, which can be easily verified by inspection of cluster_ind.collect() above.

Related

How does pytorch L1-norm pruning works?

Lets see the result that I got first. This is one of a convolution layer of my model, and im only showing 11 filter's weight of it (11 3x3 filter with channel=1)
Left side is original weight Right side is Pruned weight
So I was wondering how does the "TORCH.NN.UTILS.PRUNE.L1_UNSTRUCTURED" works because by the pytorch website said, it prune the lowest L1-norm unit, but as far as I know, L1-norm pruning is a filter pruning method which prune the whole filter which use this equation to fine the lowest filter value instead of pruning single weight. So I'm a bit curious about how does this function actually works?
The following is my pruning code
parameters_to_prune = (
(model.input_layer[0], 'weight'),
(model.hidden_layer1[0], 'weight'),
(model.hidden_layer2[0], 'weight'),
(model.output_layer[0], 'weight')
)
prune.global_unstructured(
parameters_to_prune,
pruning_method=prune.L1Unstructured,
amount = (pruned_percentage/100),
)
The nn.utils.prune.l1_unstructured utility does not prune the whole filter, it prunes individual parameter components as you observed in your sheet. That is components with the lower norm get masked.
Here is a minimal example as discussed in the comments below:
>>> m = nn.Linear(10,1,bias=False)
>>> m.weight = nn.Parameter(torch.arange(10).float())
>>> prune.l1_unstructured(m, 'weight', .3)
>>> m.weight
tensor([0., 0., 0., 3., 4., 5., 6., 7., 8., 9.], grad_fn=<MulBackward0>)

What is meant by id's and labels in keras data generator?

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
The like above is the documentation regarding the custom keras data generator.
I have doubt in the "NOTATION" heading in the above link which says the following:-
Before getting started, let's go through a few organizational tips that are particularly useful when dealing with large datasets.
Let ID be the Python string that identifies a given sample of the dataset. A good way to keep track of samples and their labels is to adopt the following framework:
1. Create a dictionary called partition where you gather:
a) in partition['train'] a list of training IDs
b) in partition['validation'] a list of validation IDs
2. Create a dictionary called labels where for each ID of the dataset, the associated label is given by labels[ID]
For example, let's say that our training set contains id-1, id-2 and id-3 with respective labels 0, 1 and 2, with a validation set containing id-4 with label 1. In that case, the Python variables partition and labels look like
>>> partition
{'train': ['id-1', 'id-2', 'id-3'], 'validation': ['id-4']}
and
>>> labels
{'id-1': 0, 'id-2': 1, 'id-3': 2, 'id-4': 1}
I'm really not able to understand what does labels and id's mean.
For example:- Say, I have a data frame, where there are 1000 columns. Each row corresponds to id's i.e., each ID meant to be just a "DATA POINT".
OR
Say, I have multiple data frame. Each data frame represents different id's?
It seems labels meant not to be the number of class-variable.
I would like to have a clear understanding regarding id's and labels WITH SOME EXAMPLES.
The mentioned article provides a good practice to better organize your data between training and validation. To do so, it's relevant to store line indexes from your dataframe (named IDs here) and corresponding target values (named label here) in an independent object so that in case of transformation on the input, you don't lose track of things.
Here is a basic example using a train/test split
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame([[0.1, 1, 'label_a'], [0.2, 2, 'label_a'], [0.3, 3, 'label_a'], [0.4, 4, 'label_b']], columns=['feature_a', 'feature_b', 'target'])
# df.index.tolist() results in [0, 1, 2, 3] (4 rows)
partitions = dict()
labels = dict()
X_train, X_test, y_train, y_test = train_test_split(df[['feature_a', 'feature_b']], df['target'], test_size=0.25, random_state=42)
partitions['train'] = X_train.index.tolist()
partitions['validation'] = X_test.index.tolist()
# partitions['train'] results in [3, 0, 2]
# partitions['validation'] results in [1]
labels = df['target'].to_dict()
# labels is {0: 'label_a', 1: 'label_a', 2: 'label_a', 3: 'label_b'}```

Should the same imputer co-efficients be used for training and test datasets?

I am learning how to prepare data, build estimators and check using a train/test data split.
My question is how I can prepare the test dataset correctly.
I split my data into a test and a training set. And as "Hands on with machine learning with Scikit-Learn" teaches me, I set up a pipeline for my data preparation:
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])
After training my estimator, I want to use my trained estimator on test data to validate my accuracy. However if I pass my test feature data through the pipeline I defined, isn't it calculating a new median value from only the test dataset and the std_scalar based on the test dataset which will be different values to what were arrived at in the training dataset?
I presume for consistency I want to re-use the variables achieved during training. That is what the estimator has been fitted on. For example, if the test set was just a single row (or in production I have a single input I want to derive a prediction from), then the median values wouldn't even be achievable if the single input has a NaN!
What step am I missing?
you must keep in mind, what is happening:
Imagen you have the following dataset (input features):
data = [[0, 1], [1, 0], [1, 0], [1, 1]]
scaler = StandardScaler()
scaler.fit(data)
print(scaler.mean_)
[0.75 0.55]
print(scaler.transform(data))
[[-1.73205081 1. ]
[ 0.57735027 -1. ]
[ 0.57735027 -1. ]
[ 0.57735027 1. ]]
but now if you only use (what you are doing in your approach):
data = [[0, 1], [1, 0]]
data2 = [[1,0], [1,1]]
scaler = StandardScaler()
scaler.fit(data)
print(scaler.mean_)
[0.5 0.5]
print(scaler.transform(data2))
[[ 1. -1.]
[ 1. 1.]]
but as test data is named: keep the data completly untouched before you run your algorithm.
https://stats.stackexchange.com/questions/267012/difference-between-preprocessing-train-and-test-set-before-and-after-splitting

Using Kmeans to cluster small phrases in Spark

I am having a list of words/phrases(around a million) that I would like to cluster. I am assuming that its the following list:
a_list = [u'java',u'javascript',u'python dev',u'pyspark',u'c ++']
a_list_rdd = sc.parallelize(a_list)
and I follow this procedure:
Using a string distance(lets say jaro winkler metric) i compute all the distance between the list of the words which will create a matrix of 5x5 with the diagonal being ones, as it computes the distances between itself. And to compute all the distances I broadcast the whole list. So:
a_list_rdd_broadcasted = sc.broadcast(a_list_rdd.collect())
and the string distances computations:
import jaro
def ComputeStringDistance(phrase,phrase_list_broadcasted):
keyvalueDistances = []
for value in phrase_list_broadcasted:
distanceValue = jaro.jaro_winkler_metric(phrase,value)
keyvalueDistances.append(distanceValue)
return (array(keyvalueDistances))
string_distances = (a_list_rdd
.map(lambda phrase:ComputeStringDistance(phrase,a_list_rdd_broadcasted.value))
)
and using K means for clustering:
from pyspark.mllib.clustering import KMeans, KMeansModel
clusters = KMeans.train(string_distances, 3 , maxIterations=10,
runs=10, initializationMode="random")
PredictGroup = string_distances.map(lambda point:clusters.predict(point)).zip(a_list_rdd)
and the results:
PredictGroup.collect()
ut[73]:
[(0, u'java'),
(0, u'javascript'),
(2, u'python'),
(2, u'pyspark'),
(1, u'c ++')]
not bad! But what happens if I have 1 million observations and an estimation of around 10000 clusters? Reading some posts large number of clusters is really expensive. Is there a way to overpass this issue?
k-means foes not operate on a distance matrix (distance matrixes also do not scale).
K-means also does not work with arbitrary distance functions.
It's about minimizing variance, the sum-of-squared-deviations-from-the-mean.
What you are doing works because it's halfway to spectral clustering, but it's neither k-means used correctly, nor spectral clustering.

Spark KMeans clustering: get the number of sample assigned to a cluster

I am using Spark Mlib for kmeans clustering. I have a set of vectors from which I want to determine the most likely cluster center. So I will run kmeans clustering training on this set and select the cluster with the highest number of vector assigned to it.
Therefore I need to know the number of vectors assigned to each cluster after training (i.e KMeans.run(...)). But I can not find a way to retrieve this information from KMeanModel result. I probably need to run predict on all training vectors and count the label which appear the most.
Is there another way to do this?
Thank you
You are right, this info is not provided by the model, and you have to run predict. Here is an example of doing so in a parallelized way (Spark v. 1.5.1):
from pyspark.mllib.clustering import KMeans
from numpy import array
data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0, 10.0, 9.0]).reshape(5, 2)
data
# array([[ 0., 0.],
# [ 1., 1.],
# [ 9., 8.],
# [ 8., 9.],
# [ 10., 9.]])
k = 2 # no. of clusters
model = KMeans.train(
sc.parallelize(data), k, maxIterations=10, runs=30, initializationMode="random",
seed=50, initializationSteps=5, epsilon=1e-4)
cluster_ind = model.predict(sc.parallelize(data))
cluster_ind.collect()
# [1, 1, 0, 0, 0]
cluster_ind is an RDD of the same cardinality with our initial data, and it shows which cluster each datapoint belongs to. So, here we have two clusters, one with 3 datapoints (cluster 0) and one with 2 datapoints (cluster 1). Notice that we have run the prediction method in a parallel fashion (i.e. on an RDD) - collect() is used here only for our demonstration purposes, and it is not needed in a 'real' situation.
Now, we can get the cluster sizes with
cluster_sizes = cluster_ind.countByValue().items()
cluster_sizes
# [(0, 3), (1, 2)]
From this, we can get the maximum cluster index & size as
from operator import itemgetter
max(cluster_sizes, key=itemgetter(1))
# (0, 3)
i.e. our biggest cluster is cluster 0, with a size of 3 datapoints, which can be easily verified by inspection of cluster_ind.collect() above.

Resources