Multilabel classification using catboost spark - apache-spark

Trying to implement multilabel classification using catboost spark, I was faced with the error:
Py4JJavaError: An error occurred while calling o2950.fit.
: ai.catboost.CatBoostError: unsupported target column type: org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7
According to the documentation, I cannot set multiple columns in setLabelCol, so I transform my target columns to vector with VectorAssembler and pass this one to the model. Where is my mistake?

Related

Pyspark ML CrossValidator evaluate several evaluators

In Sklearn in the GridSearchCV we can give the model different scorings and with the refit param we refit one of them using the best found parameters in on the whole dataset.
Is there any way to do something similar with CrossValidator from the ML package from pyspark?

Why do I keep getting the error: An error was encountered: name 'tf' is not defined when trying to load a model on one of core instances in EMR?

I am trying to %%spark on an EMR cluster to make predictions in parallel on multiple files using a pretrained keras model. I can load the model on the master node and do predictions with it, but when I try to use spark I get the error "name 'tf' is not defined" when trying to use or access the model.
Up to now I have tried the following solutions and they all produce the same error:
I have tried passing the model in a lambda wrapper.
I have broadcasted the model using sc.broadcast
I downloaded the model.h5 on all nodes and usee load_model for each partition. For load_model I have used the custom_objects={'tf': tf} solution and also I am importing tensorflow as tf in the wrapper function where the model is loaded.
None of the solutions worked. Did anyone had any similar experience? I am using r5 instances for the master and m5 instances for the nodes.

How do you implement a model built using sklearn pipeline in pyspark?

I would like to use the model I built using sklearn pipeline in pyspark. The pipeline takes care of imputation, scaling and one-hot encoding and Random Forest Classification.I tried broadcasting the model and using pandas udf to predict.it did not work, got py4jjavaerror.

(could not convert string to float) error while using knn algorithm

Dataset
So I am trying to implement KNN classification algorithm but am facing an error while trying to fit the model. Please help guys I am a beginner.
Error
Data type of columns
The numeric columns which are of object type are the ones giving the error cause when I try to fit without them then its working. How to convert them
You can't have any non-numerique features on your dataset. You should use encoding for all of your non-numerique features.
Scikit Learn Preprocessing

PySpark MLlib: AssertionError: Classifier doesn't extend from HasRawPredictionCol

I am a newbie in Spark . I want to use multiclass classification for SVM in PySpark MLlib. I installed Spark 2.3.0 on Windows.
But I searched and found that SVM is implemented for binary classification only in Spark , so we have to use one-vs-all strategy. It gave me an error when I tried to use one-vs-all with SVM . I searched for the error but do not find a solution for it.
I used the code of one-vs-all from this link
https://spark.apache.org/docs/2.1.0/ml-classification-regression.html#one-vs-rest-classifier-aka-one-vs-all
here is my code :
from pyspark.mllib.classification import SVMWithSGD , SVMModel
from pyspark.ml.classification import OneVsRest
# instantiate the One Vs Rest Classifier.
svm_model = SVMWithSGD()
ovr = OneVsRest(classifier=svm_model)
# train the multiclass model.
ovrModel = ovr.fit(rdd_train)
# score the model on test data.
predictions = ovrModel.transform(rdd_test)
The error is in the line "ovr.fit(rdd_train)". Here is the error
File "D:/Mycode-newtrials - Copy/stance_detection -norelieff-lgbm - randomizedsearch - modified - spark.py", line 1460, in computescores
ovrModel = ovr.fit(rdd_train)
File "D:\python27\lib\site-packages\pyspark\ml\base.py", line 132, in fit
return self._fit(dataset)
File "D:\python27\lib\site-packages\pyspark\ml\classification.py", line 1758, in _fit
"Classifier %s doesn't extend from HasRawPredictionCol." % type(classifier)
AssertionError: Classifier <class 'pyspark.mllib.classification.SVMWithSGD'> doesn't extend from HasRawPredictionCol.
You get the error because you are trying to use a model from Spark ML (OneVsRest) with a base binary classifier from Spark MLlib (SVMWithSGD).
Spark MLlib (the old, RDD-based API) and Spark ML (the new, dataframe-based API) are not only different libraries, but they are also incompatible: you cannot mix models between them (looking closer at the examples, you'll see that they import the base classifier from pyspark.ml, and not from pyspark.mllib, as you are trying to do here).
Unfortunately, as at the time of writing (Spark 2.3) Spark ML does not include SVMs, you cannot currently use the algorithm as a base classifier with OneVsRest...

Resources