Caching factor of MatrixFactorizationModel in PySpark - apache-spark

after loading a saved MatrixFactorizationModel I get the warnings:
MatrixFactorizationModelWrapper: Product factor does not have a partitioner. Prediction on individual records could be slow.
MatrixFactorizationModelWrapper: Product factor is not cached. Prediction could be slow.
and indeed the computation is slow and will not scale well
how do I set a partitioner and cache the Product factor?
adding code that demonstrates the problem:
from pyspark import SparkContext
import sys
sc = SparkContext("spark://hadoop-m:7077", "recommend")
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
model = MatrixFactorizationModel.load(sc, "model")
model.productFeatures.cache()
i get:
Traceback (most recent call last):
File "/home/me/recommend.py", line 7, in
model.productFeatures.cache()
AttributeError: 'function' object has no attribute 'cache'

Concerning the caching, like I wrote in the comment box, you can cache your rdd doing the following :
rdd.cache() # for Scala, Java and Python
EDIT: The userFeatures and the productFeatures are both of type RDD[(Int, Array[Double]). (Ref. Official Documentation)
To cache the productFeature, you can do the following
model.productFeatures().cache()
Of course I consider that loaded model is called model.
Example :
r1 = (1, 1, 1.0)
r2 = (1, 2, 2.0)
r3 = (2, 1, 2.0)
ratings = sc.parallelize([r1, r2, r3])
from pyspark.mllib.recommendation import ALS
model = ALS.trainImplicit(ratings, 1, seed=10)
model.predict(2, 2)
feats = model.productFeatures()
type(feats)
>> MapPartitionsRDD[137] at mapPartitions at PythonMLLibAPI.scala:1074
feats.cache()
As for the warning concerning the partitioner, even if you partition your model, let's say by feature with .partitionBy() to balance it it would still be too expensive performance.
There is a JIRA ticket (SPARK-8708) concerning this issue that should be resolved in the next release of Spark (1.5).
Nevertheless, if you want to learning more about partitioning algorithms, I invite you to read the the discussion in this ticket SPARK-3717 that argues about partitioning by features within the DecisionTree and RandomForest algorithms.

Related

How to leave scikit-learn esimator result in dask distributed system?

You can find a minimal-working example below (directly taken from dask-ml page, only change is made to the Client() to make it work in distributed system)
import numpy as np
from dask.distributed import Client
import joblib
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
# Don't forget to start the dask scheduler and connect worker(s) to it.
client = Client('localhost:8786')
digits = load_digits()
param_space = {
'C': np.logspace(-6, 6, 13),
'gamma': np.logspace(-8, 8, 17),
'tol': np.logspace(-4, -1, 4),
'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=3, n_iter=50, verbose=10)
with joblib.parallel_backend('dask'):
search.fit(digits.data, digits.target)
But this returns the result to the local machine. This is not exactly my code. In my code
I am using scikit-learn tfidf vectorizer. After I use fit_transform(), it is returning the fitted and transformed data (in sparse format) to my local machine. How can I leave the results inside the distributed system (cluster of machines)?
PS: I just encountered this from dask_ml.wrappers import ParallelPostFit Maybe this is the solution?
The answer was in front of my eyes and I couldn't see it for 3 days of searching. ParallelPostFit is the answer. The only problem is that it doesn't support fit_transform() but fit() and transform() works and it returns a lazily evaluated dask array (that is what I was looking for). Be careful about this warning:
Warning
ParallelPostFit does not parallelize the training step. The underlying
estimator’s .fit method is called normally.

Getting TypeErrror:DecisionTreeClassifier' object is not iterable in sparkml lib

I am trying to implement a decision tree in spark Mllib with help of Coursera "Machine learning for big data". I have got below error
<class 'pyspark.ml.classification.DecisionTreeClassifier'>
Traceback (most recent call last):
File "C:/sparkcourse/Pycharmproject/Decisiontree.py", line 65, in <module>
model=modelpipeline.fit(traindata)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 64, in fit
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 93, in _fit
TypeError: 'DecisionTreeClassifier' object is not iterable
Here is the code
from pyspark.sql import SparkSession
from pyspark.sql import DataFrameNaFunctions
#pipeline is estimator or transformer
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").enableHiveSupport().getOrCreate()
weatherdata=spark.read.csv("file:///SparkCourse/daily_weather.csv",header="true",inferSchema="true")
#print(weatherdata.columns)
#for input features we explicitly take the columns
featurescolumn=['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am']
#print(featurescolumn)
weatherdata=weatherdata.drop("number")
#print(weatherdata.columns)
#missing value dealing
weatherdata=weatherdata.na.drop()
#print(weatherdata.count(),len(weatherdata.columns))
#create a categorical variable to denote if humid is not low(we weill deal heare relative_humidity_3pm column).if value is
#less than 25% then categorical value is 0 or if higher it will be 1. using binarizer will solve this
binarizer=Binarizer(threshold=24.99999,inputCol='relative_humidity_3pm',outputCol='low_humid')
#we transform whole weatherdata into Binarizer categorical value
binarizerDf=binarizer.transform(weatherdata)
#binarizerDf.select("relative_humidity_3pm",'low_humid').show(4)
#aggregating the fetures that will be used to make prediction into single columns
#The inputCols argument specifies our list of column names we defined earlier, and outputCol is the name of the new column. The second line creates a new DataFrame with the aggregated features in a column.
assembler=VectorAssembler(inputCols=featurescolumn,outputCol="features")
assembled=assembler.transform(binarizerDf)
#assembled.select("features").show(1)
#spliting Train and Test data by calling randomsplit
(traindata, testdata)=assembled.randomSplit([0.80,0.20],seed=1234)
#data counting
print(traindata.count(),testdata.count())
#create decision trees Model
#----------------------------------
#The labelCol argument is the column we are trying to predict, featuresCol specifies the aggregated features column, maxDepth is stopping criterion for tree induction based on maximum depth of tree
#minInstancesPerNode is stopping criterion for tree induction based on minimum number of samples in a node
#impurity is the impurity measure used to split nodes.
decisiontree=DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5,minInstancesPerNode=20,impurity="gini")
print(type(decisiontree))
#creating model by training the decision tree, pipeline solve this
modelpipeline=Pipeline(stages=decisiontree)
model=modelpipeline.fit(traindata)
#predicting test data
predictions=model.transform(testdata)
#showing predictedvalue
prediction=predictions.select('prediction','label').show(5)
The course is using spark 1.6 in cloud era VM. but i have integrated spark 2.1.0 with PyCharm.
stages should a sequence of PipelineStages (Transofmers or Esitmators), not a single Estimator. Replace:
Pipeline(stages=decisiontree)
with
Pipeline(stages=[decisiontree])

FPGrowth: Input data is not cached pyspark

I am trying to run following example code. Even-though I have cached my data, I am getting "Input data is not cached pyspark" warning. Because of this issue, I am not able to use fp growth algorithm for large datasets.
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import SparkSession
"""
An example demonstrating FPGrowth.
Run with:
bin/spark-submit examples/src/main/python/ml/fpgrowth_example.py
"""
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("FPGrowthExample")\
.getOrCreate()
# $example on$
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
df = df.cache()
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
# Display frequent itemsets.
model.freqItemsets.show()
# Display generated association rules.
model.associationRules.show()
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(df).show()
spark.stop()
Why:
Because ml.fpm.FPGrowth converts data to RDD and runs mllib.fpm.FPGrowth on this RDD. RDD is not cached and this causes the warning in mllib code.
What can you do about it:
In your code nothing. If you think this is a big issue (shouldn't be) open a JIRA ticket and create a pull request.
Because of this issue, I am not able to use fp growth algorithm for large datasets.
It can cause unnecessary allocation and slowdown, but shouldn't be limiting. If you experience failure it is possible that parameters require tuning.

How to save Spark model as a file

I'm testing out the document code at https://spark.apache.org/docs/1.6.2/mllib-ensembles.html#random-forests. For some reason, myRandomForestClassificationModel was saved as a directory. How do I save it as a file? I'm new to Spark, so I'm not sure if I did anything wrong in the code.
from pyspark import SparkContext
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
sc = SparkContext(appName="rf")
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, '/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a RandomForest model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
# Note: Use larger numTrees in practice.
# Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=100, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification forest model:')
print(model.toDebugString())
# Save and load model
model.save(sc, "/rf/myRandomForestClassificationModel")
sameModel = RandomForestModel.load(sc, "/rf/myRandomForestClassificationModel")
Nothing is wrong with your code. It is correct that models are saved as a directory, specifically there is a modeland metadata directory. This makes sense as Spark is a distributed system. It's like when you save data back to hdfs or s3 which happens in parallel, this is also done with the model.

PySpark, Decision Trees (Spark 2.0.0)

I am new to Spark (using PySpark). I tried running the Decision Tree tutorial from here (link). I execute the code:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils
# Load and parse the data file, converting it to a DataFrame.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Now this line fails
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
I get the error message:
IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT#f71b0bce.'
When searching the web for this error I found an answer that says:
use
from pyspark.ml.linalg import Vectors, VectorUDT
instead of
from pyspark.mllib.linalg import Vectors, VectorUDT
which is odd, since I haven't used it. Also, adding this import to my code solves nothing and I still get the same error.
I am not quite clear on how to debug this situation. When looking into the raw data I see:
data.show()
+--------------------+-----+
| features|label|
+--------------------+-----+
|(692,[127,128,129...| 0.0|
|(692,[158,159,160...| 1.0|
|(692,[124,125,126...| 1.0|
|(692,[152,153,154...| 1.0|
This looks like a list, starts with '('.
I am not sure how to solve this issue, or even debug.
The source of the problem seems to be executing spark 1.5.2. example on spark 2.0.0 (see below reference to spark 2.0 example).
The difference between spark.ml and spark.mllib
As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.
More details can be found here: http://spark.apache.org/docs/latest/ml-guide.html
Using spark 2.0 please try Spark 2.0.0 example (https://spark.apache.org/docs/2.0.0/mllib-decision-tree.html)
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
# Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
Find full example code at "examples/src/main/python/mllib/decision_tree_classification_example.py" in the Spark repo.

Resources