Clustering data using DBSCAN and spark_sklearn - apache-spark

I want to cluster my input data using DBSCAN and spark_sklearn. I'd like to get the labels of each input instance after clustering. Is it possible?
Reading the documentation on http://pythonhosted.org/spark-sklearn, I tried the following:
temp_data = Spark DataFrame containing 'key' and 'features' columns,
where 'features' is a Vector.
ke = KeyedEstimator(sklearnEstimator=DBSCAN(), estimatorType="clusterer")
print ke.getOrDefault("estimatorType") --> "clusterer"
ke.fit_pedict(temp_data) --> ERROR: 'KeyedEstimator' object has no attribute 'fit_predict'
k_model = ke.fit(temp_data)
print k_model.getOrDefault("estimatorType") --> "clusterer"
k_model.fit_pedict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'fit_predict'
k_model.predict(temp_data) --> ERROR: 'KeyedModel' object has no attribute 'predict'
k_model.transform(temp_data) --> ERROR: estimatorType assumed to be a clusterer, but sklearnEstimator is missing fit_predict()
(NOTE: sklearn.cluster.DBSCAN actually have fit_predict() method)
What I normally do using sklearn (without spark) is to fit (dbscan_model.fit(temp_data-features)) and get labels from the model (labels = dbscan_model.labels_). It is also fine if I can get the 'labels_' attribute using spark-sklearn.
If the above-mentioned calls ('transform' or 'predict') doesn't work, is it possible to get the 'labels_' after fitting data using spark-sklearn? How can I do that? Assuming that we obtained the 'labels_', how can I map the input instances to the labels_? Do they have same order?

It's just possible in the case of KMeans, in which we can predict cluster labels, since the scikit-learn estimator provides this functionality.
Unfortunately, this is not the case for some other clusterers, such as DBSCAN.

I've managed to get the 'labels_' attribute; however I still don't know if the order of resulting labels are same as the input instances or not.
temp_data = Spark DataFrame containing 'key' and 'features' columns,
where 'features' is a Vector.
ke = KeyedEstimator(sklearnEstimator=DBSCAN())
k_model = ke.fit(temp_data)
def getLabels(model):
return model.estimator.labels_
labels_udf = udf(lambda x: getLabels(x).tolist(), ArrayType(IntegerType()))("estimator").alias("labels")
res_df = km_dbscan.keyedModels.select("key", labels_udf)

Related

converting a Object (containing string and integers) Pandas dataframe to a scipy sparse matrix

I have a dataframe with two columns, one column is medicine name of dtype object it contains medicine name and few of the medicine name followed by its mg(eg. Avil25 and other row for Avil50) and other column is Price of dtype int . I'm trying to convert medicine name column into a scipy csr_matrix using the following lines of code:
from scipy.sparse import csr_matrix
sparse_matrix = csr_matrix(medName)
I am getting the following error message:
TypeError: no supported conversion for types: (dtype('O'),)
as an alternative way I tried to remove the integers using(medName.str.replace('\d+', '')) from dataframe and tried sparse_matrix = csr_matrix(medName.astype(str)) . Still i am getting the same error.
What's going on wrong here?
What is another way to convert this dataframe to csr matrix?
you will have the encode strings to numeric data types for it to be made sparse. One solution ( probably not the most memory efficient) is to make a networkx graph, where the string-words will be the nodes, using the nodelist of the graph you can keep track of the word to numeric mapping.

How to count all values in one key of a pyspark RDD?

In a pyspark RDD, 'predicted_values' is the key for the results of a logistic regression. Obviously, 'predicted_values' holds only 0 and 1.
I want to count the number of 0's and 1's in the output field.
I try:
Counter(rdd.groupByKey()['predicted_value'])
which gives
TypeError: 'PipelinedRDD' object is not subscriptable
What is the best way to do this?
You could also use countByValue():
sorted(rdd.map(lambda x: x['predicted_value']).countByValue().items())
#[(0, 580), (1, 420)]
It appears that this can be done by (using the Counter class from collection):
>>> Counter([i['predicted_value'] for i in rdd.collect()]
Counter({0: 580, 1: 420})

Reconstructing k-means using pre-computed cluster centres

I'm using k-means for clustering with number of clusters 60. Since, some of the clusters are coming out as meaning less, I've deleted those cluster centers from cluster center array(count = 8) and saved in clean_cluster_array.
This time, I'm re-fitting k-means model with init = clean_cluster_centers. and n_clusters = 52 and max_iter = 1 because i want to avoid re-fitting as much as possible.
The basic idea is to recreate new model with clean_cluster_centers . The problem here is since, we are removing large number of clusters; The model is quickly configuring to more stable centers even with n_iter = 1. Is there any way to recreate k-means model?
If you've fitted a KMeans object, it has a cluster_centers_ attribute. You can directly update it by doing something like this:
cls.cluster_centers_ = new_cluster_centers
So if you want a new object with the clean cluster centers, just do something like the following:
cls = KMeans().fit(X)
cls2 = cls.copy()
cls2.cluster_centers_ = new_cluster_centers
And now, since the predict function only checks that your object has a non-null attribute called cluster_centers_, you can use the predict function
def predict(self, X):
"""Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, `cluster_centers_` is called
the code book and each value returned by `predict` is the index of
the closest code in the code book.
Parameters
----------
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns
-------
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
"""
check_is_fitted(self, 'cluster_centers_')
X = self._check_test_data(X)
x_squared_norms = row_norms(X, squared=True)
return _labels_inertia(X, x_squared_norms, self.cluster_centers_)[0]

Stratified Sampling in python scikit-learn

I want to divide my dataset into train and test sets using stratified sampling(scikitlearn).my approach is as follows :
1) I'am reading a CSV file and loading it using pandas readCSV.so ultimately i'am storing the loaded csv in a dataframe names "dataset"
dataset = pd.readCSV('CSV_NAME)
2) Now i'am applying stratified sampling as :
train,test = train_test_split(dataset,test_size=0.20,stratify=True)
But it throwing the following error :
TypeError: Singleton array array(True, dtype=bool) cannot be considered a valid collection.
So please suggest me the correct way of doing to it.
'train_test_split' needs to know what the target variable is. Therefore, you should change your call to something like:
train,test = train_test_split(dataset[needed columns], dataset.target,test_size=0.20,stratify=True)
Btw, there is a missing single quote in your first line of code.
You could convert the pandas dataframe to a numpy array by the following
import numpy
dataset = pd.readCSV('CSV_NAME')
dataset = array(dataset)
like suggested in the second answer here: https://www.quora.com/How-does-python-pandas-go-along-with-scikit-learn-library-Has-anyone-doing-data-analysis-using-pandas-and-then-then-fit-models-using-scikit-learn
Or you could read the dataset into a numpy array directly.

Filtering Spark DataFrame on new column

Context: I have a dataset too large to fit in memory I am training a Keras RNN on. I am using PySpark on an AWS EMR Cluster to train the model in batches that are small enough to be stored in memory. I was not able to implement the model as distributed using elephas and I suspect this is related to my model being stateful. I'm not entirely sure though.
The dataframe has a row for every user and days elapsed from the day of install from 0 to 29. After querying the database I do a number of operations on the dataframe:
query = """WITH max_days_elapsed AS (
SELECT user_id,
max(days_elapsed) as max_de
FROM table
GROUP BY user_id
)
SELECT table.*
FROM table
LEFT OUTER JOIN max_days_elapsed USING (user_id)
WHERE max_de = 1
AND days_elapsed < 1"""
df = read_from_db(query) #this is just a custom function to query our database
#Create features vector column
assembler = VectorAssembler(inputCols=features_list, outputCol="features")
df_vectorized = assembler.transform(df)
#Split users into train and test and assign batch number
udf_randint = udf(lambda x: np.random.randint(0, x), IntegerType())
training_users, testing_users = df_vectorized.select("user_id").distinct().randomSplit([0.8,0.2],123)
training_users = training_users.withColumn("batch_number", udf_randint(lit(N_BATCHES)))
#Create and sort train and test dataframes
train = df_vectorized.join(training_users, ["user_id"], "inner").select(["user_id", "days_elapsed","batch_number","features", "kpi1", "kpi2", "kpi3"])
train = train.sort(["user_id", "days_elapsed"])
test = df_vectorized.join(testing_users, ["user_id"], "inner").select(["user_id","days_elapsed","features", "kpi1", "kpi2", "kpi3"])
test = test.sort(["user_id", "days_elapsed"])
The problem I am having is that I cannot seem to be able to filter on batch_number without caching train. I can filter on any of the columns that are in the original dataset in our database, but not on any column I have generated in pyspark after querying the database:
This: train.filter(train["days_elapsed"] == 0).select("days_elapsed").distinct.show() returns only 0.
But, all of these return all of the batch numbers between 0 and 9 without any filtering:
train.filter(train["batch_number"] == 0).select("batch_number").distinct().show()
train.filter(train.batch_number == 0).select("batch_number").distinct().show()
train.filter("batch_number = 0").select("batch_number").distinct().show()
train.filter(col("batch_number") == 0).select("batch_number").distinct().show()
This also does not work:
train.createOrReplaceTempView("train_table")
batch_df = spark.sql("SELECT * FROM train_table WHERE batch_number = 1")
batch_df.select("batch_number").distinct().show()
All of these work if I do train.cache() first. Is that absolutely necessary or is there a way to do this without caching?
Spark >= 2.3 (? - depending on a progress of SPARK-22629)
It should be possible to disable certain optimization using asNondeterministic method.
Spark < 2.3
Don't use UDF to generate random numbers. First of all, to quote the docs:
The user-defined functions must be deterministic. Due to optimization, duplicate invocations may be eliminated or the function may even be invoked more times than it is present in the query.
Even if it wasn't for UDF, there are Spark subtleties, which make it almost impossible to implement this right, when processing single records.
Spark already provides rand:
Generates a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
and randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
which can be used to build more complex generator functions.
Note:
There can be some other issues with your code but this makes it unacceptable from the beginning (Random numbers generation in PySpark, pyspark. Transformer that generates a random number generates always the same number).

Resources