Cook's distance in Pyspark - apache-spark

I wanted to use cooks distance to remove the outlier from my dataset for regression. But I am not able to find any method to do so in pyspark. I know how we can do it in python using get_influence() method. is there any similar method in pyspark?

Related

How to use functions from sklearn into pyspark

I have a training set with 201,917 rows, 3 features and 1 target. My aim is to calculate the strength of the relationship of the individual features with the target. My choice of method for this is sklearn.feature_selection.mutual_info_regression because it works for continuous variables and can detect non-linear relationships better than the counterpart sklearn.feature_selection.f_regression. This is the line I tried to run -
feature_selection.mutual_info_regression(trainPD[['feature_1']],trainPD['target'])
Now the problem is if I run sklearn.feature_selection.mutual_info_regression in Colab, the system crashes. Hence my idea was to shift to pyspark. But pyspark.ml does not have support for sklearn.feature_selection.mutual_info_regression. So what are my options to use sklearn.feature_selection.mutual_info_regression in pyspark?
I am not sure if pandas_udf will help because here it is not the traditional pd.Series -> pd.Series conversion where pyspark parallelization works.

How to convert csv to RDD and use RDD in pyspark for some detection?

I'm currently working on a research of heart disease detection and want to use spark to process big data as it is a part of a solution of my work. But i'm having difficulty in using spark with python because i cannot grasp how to use spark. Converting csv file to RDD and then i don't understand how to work with RDD to implement classification algorithms like knn, logistic Regression etc.
So i would really appreciate it if anyone can help me in anyway.
I have tried to understand pyspark on internet but there are very few codes available and some which are available are too easy or too hard to understand. I cannot find any proper example of classification on pyspark.
To read the csv into a dataframe you can just call spark.read.option('header', 'true').csv('path/to/csv').
The dataframe will contain the columns and rows of your csv, and you can convert it into a RDD of rows with df.rdd.

How do we customize the centroids in k-means clustering

I am trying to implement k-means clustering on Spark using Python and i want to specify the initial centroids instead of taking 'random' or 'k-means++'. I want to pass an RDD which contains the list of centroids. How should I do this in Pyspark.

Python Pandas, One hot encoding on 3000000 rows freezes computer and runs slowly

I have a dataset with around three million rows in it, with a mix of categorical and numerical data and I want to use Scikit learn's regression on the data.
Non-numeric data can't be put into a regression model so I am looking for the best way to encode the data this data.
I believe that One Hot Encoding is the best way to go and have come across two methods of doing this:
The first using pandas built in function get_dummies
setData = pd.get_dummies(s['is_self','media','agg_accounts_active', 'weekday', 'hour', 'submission_type', 'advertiser_category', 'lang'])
The second using scikit learn's labelEncoder and oneHotEncoder
X_int = LabelEncoder().fit_transform(labels.ravel()).reshape(*labels.shape)
setData = OneHotEncoder().fit_transform(X_int).toarray()
I prefer the pandas method as it deals with numerics itself. However I have run into efficiency problems, the get_dummies method freezes my computer when I try to use it on the three million rows.
Is there a better or more effecint way to do this?

Customize Distance Formular of K-means in Apache Spark Python

Now I'm using K-means for clustering and following this tutorial and API.
But I want to use custom formula for calculate distances. So how can I pass custom distance functions in k-means with PySpark?
In general using a different distance measure doesn't make sense, because k-means (unlike k-medoids) algorithm is well defined only for Euclidean distances.
See Why does k-means clustering algorithm use only Euclidean distance metric? for an explanation.
Moreover MLlib algorithms are implemented in Scala, and PySpark provides only the wrappers required to execute Scala code. Therefore providing a custom metric as a Python function, wouldn't be technically possible without significant changes in the API.
Please note that since Spark 2.4 there are two built-in measures that can be used with pyspark.ml.clustering.KMeans and pyspark.ml.clustering.BisectingKMeans. (see DistanceMeasure Param).
euclidean for Euclidean distance.
cosine for cosine distance.
Use at your own risk.

Resources