Export and Import models/pipeline as JSON using PySpark - apache-spark

I have a fitted pipeline in pyspark. I want to convert the fitted pipeline into JSON and store it in the database so another person can fetch it from the database and convert it again into a fitted pipeline spark object.
I am using jupyter-notebook to run the spark
from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
data = (spark.read.option("delimiter", ",")
.option("inferSchema", "true")
.option("header", "true")
.csv("./data/data.csv")).drop("id")
reduced_numeric_cols = ["account_length", "number_vmail_messages", "total_day_calls",
"total_day_charge", "total_eve_calls", "total_eve_charge",
"total_sales_calls", "total_sales_calls", "total_sales_charge"]
label_indexer = StringIndexer(inputCol = 'quantity', outputCol = 'label')
plan_indexer = StringIndexer(inputCol = 'sales_plan', outputCol = 'sales_plan_indexed')
assembler = VectorAssembler(
inputCols = ['sales_plan_indexed'] + reduced_numeric_cols,
outputCol = 'features')
classifier = GBTClassifier(labelCol = 'label', featuresCol = 'features')
pipeline = Pipeline(stages=[plan_indexer, label_indexer, assembler, classifier])
(train, test) = data.randomSplit([0.7, 0.3]).
model = pipeline.fit(train)
// I want to save the model as a JSON object in file
// Read the JSON file and convert it into again the fitted model

Related

Multiple Evaluators in CrossValidator - Spark ML

Is it possible to have more than 1 evaluator in a CrossValidator to get R2 and RMSE at the same time?
Instead of having two different CrossValidator:
val lr_evaluator_rmse = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("rmse")
val lr_evaluator_r2 = new RegressionEvaluator()
.setLabelCol("ArrDelay")
.setPredictionCol("predictionLR")
.setMetricName("r2")
val lr_cv_rmse = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
val lr_cv_r2 = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Something like this:
val lr_cv = new CrossValidator()
.setEstimator(lr_pipeline)
.setEvaluator(lr_evaluator_rmse)
.setEvaluator(lr_evaluator_r2)
.setEstimatorParamMaps(lr_paramGrid)
.setNumFolds(3)
.setParallelism(3)
Thanks in advance
The PySpark documentation on CrossValidator indicates that the evaluator argument is a single entity --> evaluator: Optional[pyspark.ml.evaluation.Evaluator] = None
The solution I went with was to create separate pipelines for each evaluator. For example,
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator
# Convert inputs to vector assembler
vec_assembler = VectorAssembler(inputCols=[inputs], outputCol="features")
# Create Random Forest Classifier pipeline
rf = RandomForestClassifier(labelCol="label", seed=42)
multiclass_evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="label", metricName="accuracy")
binary_evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label")
# Plop model objects into cross validator
cv1 = CrossValidator(estimator=rf, evaluator=multiclass_evaluator, numFolds=3, parallelism=4, seed=42)
cv2 = CrossValidator(estimator=rf, evaluator=binary_evaluator, numFolds=3, parallelism=4, seed=42)
# Put all step in a pipeline
pipeline1 = Pipeline(stages=[vec_assembler, cv1])
pipeline2 = Pipeline(stages=[vec_assembler, cv2])

Confusion Matrix to get precsion,recall, f1score

I have a dataframe df. I have performed decisionTree classification algorithm on the dataframe. The two columns are label and features when algorithm is performed. The model is called dtc. How can I create a confusion matrix in pyspark?
dtc = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label')
dtcModel = dtc.fit(train)
predictions = dtcModel.transform(test)
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.evaluation import MulticlassMetrics
preds = df.select(['label', 'features']) \
.df.map(lambda line: (line[1], line[0]))
metrics = MulticlassMetrics(preds)
# Confusion Matrix
print(metrics.confusionMatrix().toArray())```
You need to cast to an rdd and map to tuple before calling metrics.confusionMatrix().toArray().
From the official documentation,
class pyspark.mllib.evaluation.MulticlassMetrics(predictionAndLabels)[source]
Evaluator for multiclass classification.
Parameters: predictionAndLabels – an RDD of (prediction, label) pairs.
Here is an example to guide you.
ML part
import pyspark.sql.functions as F
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.mllib.evaluation import MulticlassMetrics
from pyspark.sql.types import FloatType
#Note the differences between ml and mllib, they are two different libraries.
#create a sample data frame
data = [(1.54,3.45,2.56,0),(9.39,8.31,1.34,0),(1.25,3.31,9.87,1),(9.35,5.67,2.49,2),\
(1.23,4.67,8.91,1),(3.56,9.08,7.45,2),(6.43,2.23,1.19,1),(7.89,5.32,9.08,2)]
cols = ('a','b','c','d')
df = spark.createDataFrame(data, cols)
assembler = VectorAssembler(inputCols=['a','b','c'], outputCol='features')
df_features = assembler.transform(df)
#df.show()
train_data, test_data = df_features.randomSplit([0.6,0.4])
dtc = DecisionTreeClassifier(featuresCol='features',labelCol='d')
dtcModel = dtc.fit(train_data)
predictions = dtcModel.transform(test_data)
Evaluation part
#important: need to cast to float type, and order by prediction, else it won't work
preds_and_labels = predictions.select(['predictions','d']).withColumn('label', F.col('d').cast(FloatType())).orderBy('prediction')
#select only prediction and label columns
preds_and_labels = preds_and_labels.select(['prediction','label'])
metrics = MulticlassMetrics(preds_and_labels.rdd.map(tuple))
print(metrics.confusionMatrix().toArray())
Use this:
import sklearn
from pyspark.ml.classification import RandomForestClassifier
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'label', numTrees=500)
rfModel = rf.fit(train)
predictions_train = rfModel.transform(train)
y_true = predictions_train.select(['label']).collect()
y_pred = predictions_train.select(['prediction']).collect()
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_true, y_pred))
where train is your training data.

Spark 2.1.1: How to predict topics in unseen documents on already trained LDA model in Spark 2.1.1?

I am training an LDA model in pyspark (spark 2.1.1) on a customers review dataset. Now based on that model I want to predict the topics in the new unseen text.
I am using the following code to make the model
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.ml.clustering import DistributedLDAModel, LocalLDAModel
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.sql.functions import *
import pyspark.sql.functions as F
path = "D:/sparkdata/sample_text_LDA.txt"
sc = SparkContext("local[*]", "review")
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)
data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()
docDF = spark.createDataFrame(data)
remover = StopWordsRemover(inputCol="words",
outputCol="stopWordsRemoved")
stopWordsRemoved_df = remover.transform(docDF).cache()
Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")
model = Vector.fit(stopWordsRemoved_df)
result = model.transform(stopWordsRemoved_df)
corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()
# Cluster the documents topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary
print(ldaModel.describeTopics())
wordNumbers = 10 # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))
def topic_render(topic): # specify vector id of words to actual words
terms = topic[0]
result = []
for i in range(wordNumbers):
term = vocabArray[terms[i]]
result.append(term)
return result
topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()
for topic in range(len(topics_final)):
print("Topic" + str(topic) + ":")
for term in topics_final[topic]:
print (term)
print ('\n')
Now I have a dataframe with a column having new customer reviews and I want to predict that to which topic cluster they belong.
I have searched for answers, mostly the following way is recommended, as here Spark MLlib LDA, how to infer the topics distribution of a new unseen document?.
newDocuments: RDD[(Long, Vector)] = ...
topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)
However, I get the following error:
'LDAModel' object has no attribute 'toLocal'.
Neither do it have topicDistribution attribute.
So are these attributes not supported in spark 2.1.1?
So any other way to infer topics from the unseen data?
You're going to need to pre-process the new data:
# import a new data set to be passed through the pre-trained LDA
data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index
documents_new = data_text_new
#documents_new = documents.dropna(subset=['Preprocessed Document'])
# process the new data set through the lemmatization, and stopwork functions
processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)
# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]
Then you can just pass it through the trained LDA as a function. All you need is that bow_corpus:
ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
If you want it out in a csv try this:
a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new
topic_0=[]
topic_1=[]
topic_2=[]
for i in a:
topic_0.append(i[0][1])
topic_1.append(i[1][1])
topic_2.append(i[2][1])
d = {'Your Target Column': b['Your Target Column'].tolist(),
'topic_0': topic_0,
'topic_1': topic_1,
'topic_2': topic_2}
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')
I hope this helps :)

Pyspark spark-ml metadata dictionary is empty

I am implementing a text classifier in pyspark as below
tokenizer = RegexTokenizer(inputCol="documents", outputCol="tokens", pattern='\\W+')
remover = StopWordsRemover(inputCol='tokens', outputCol='nostops')
vectorizer = CountVectorizer(inputCol='nostops', outputCol='features', vocabSize=1000)
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel", handleInvalid='skip')
labelIndexer_model = labelIndexer.fit(countModel_df)
convertor = IndexToString(inputCol='prediction', outputCol='predictedLabel', labels=labelIndexer_model.labels)
rfc = RandomForestClassifier(featuresCol='features', labelCol='indexedLabel', numTrees=30)
evaluator = BinaryClassificationEvaluator(labelCol='indexedLabel', rawPredictionCol='prediction')
pipe_rfc = Pipeline(stages=[tokenizer, remover, labelIndexer, vectorizer, rfc, convertor])
train_df, test_df = df.randomSplit((0.8, 0.2), seed=42)
model = pipe_rfc.fit(train_df)
prediction_rfc_df = rfc_model.transform(test_df)
The code is working and the prediction_rfc_df makes the predictions as expected. But when i want to check the metadata - the metadata dictionary is empty as below
prediction_rfc_df.schema['features'].metadata
Output : {}
prediction_rfc_df.schema['label'].metadata
Output: {}
Any ideas why the metadata is missing in the DataFrame ?
I am reading the Data from a Cassandra table as below:
df = spark.read \
.format("org.apache.spark.sql.cassandra") \
.options(table='table_name', keyspace='key_space_name') \
.load()

How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?
ML provides CrossValidator class which can be used to perform cross-validation and parameter search. Assuming your data is already preprocessed you can add cross-validation as follows:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// [label: double, features: vector]
trainingData org.apache.spark.sql.DataFrame = ???
val nFolds: Int = ???
val numTrees: Int = ???
val metric: String = ???
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setNumTrees(numTrees)
val pipeline = new Pipeline().setStages(Array(rf))
val paramGrid = new ParamGridBuilder().build() // No parameter search
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
// "f1" (default), "weightedPrecision", "weightedRecall", "accuracy"
.setMetricName(metric)
val cv = new CrossValidator()
// ml.Pipeline with ml.classification.RandomForestClassifier
.setEstimator(pipeline)
// ml.evaluation.MulticlassClassificationEvaluator
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(nFolds)
val model = cv.fit(trainingData) // trainingData: DataFrame
Using PySpark:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
trainingData = ... # DataFrame[label: double, features: vector]
numFolds = ... # Integer
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = MulticlassClassificationEvaluator() # + other params as in Scala
pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder.
.addGrid(rf.numTrees, [3, 10])
.addGrid(...) # Add other parameters
.build())
crossval = CrossValidator(
estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=numFolds)
model = crossval.fit(trainingData)
To build on zero323's great answer using Random Forest Classifier, here is a similar example for Random Forest Regressor:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.regression.RandomForestRegressor // CHANGED
import org.apache.spark.ml.evaluation.RegressionEvaluator // CHANGED
import org.apache.spark.ml.feature.{VectorAssembler, VectorIndexer}
val numFolds = ??? // Integer
val data = ??? // DataFrame
// Training (80%) and test data (20%)
val Array(train, test) = data.randomSplit(Array(0.8,0.2))
val featuresCols = data.columns
val va = new VectorAssembler()
va.setInputCols(featuresCols)
va.setOutputCol("rawFeatures")
val vi = new VectorIndexer()
vi.setInputCol("rawFeatures")
vi.setOutputCol("features")
vi.setMaxCategories(5)
val regressor = new RandomForestRegressor()
regressor.setLabelCol("events")
val metric = "rmse"
val evaluator = new RegressionEvaluator()
.setLabelCol("events")
.setPredictionCol("prediction")
// "rmse" (default): root mean squared error
// "mse": mean squared error
// "r2": R2 metric
// "mae": mean absolute error
.setMetricName(metric)
val paramGrid = new ParamGridBuilder().build()
val cv = new CrossValidator()
.setEstimator(regressor)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(numFolds)
val model = cv.fit(train) // train: DataFrame
val predictions = model.transform(test)
predictions.show
val rmse = evaluator.evaluate(predictions)
println(rmse)
Evaluator metric source:
https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.evaluation.RegressionEvaluator

Resources