Error saving a linear regression model with MLLib - python-3.x

Trying to save my linear regression model to disk I receive this error: "TypeError: save() takes 2 positional arguments but 3 were given"
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
sc= SparkContext()
lr = LinearRegression(featuresCol = 'features', labelCol='NextOrderInDays', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
lr_model.save(sc, "lr_model.model")
Searching the web outputs something similar to what I wrote. What do I miss as 3rd argument?
Thanks

You use the ml package not the mllib: from pyspark.ml.regression import LinearRegression.
So the save function has only one argument: the path (cf. documentation).

Related

How do I convert a h2o4gpu Kmeans object to sklearn Kmeans object?

I'm in a spot where I need to convert then save h2o4gpu Kmeans object to just a sklearn object.
I thought maybe I could just do this? I was expecting I would be able to save sklearn_model and load it, but I get error: AttributeError: 'KMeans' object has no attribute '_n_threads'
from h2o4gpu.solvers import KMeans as GPUKMeans
from sklearn.cluster import KMeans
...
gpu_model = GPUKMeans(n_clusters=num_clusters)
gpu_model.fit(embeddings)
sklearn_model = KMeans(n_clusters=num_clusters)
sklearn_model.cluster_centers_= gpu_model.cluster_centers_;
...
After digging into source code I found some code that was doing a similar thing:
from h2o4gpu.solvers import KMeans as GPUKMeans
from sklearn.cluster import KMeans
from sklearn.utils._openmp_helpers import _openmp_effective_n_threads
...
gpu_model = GPUKMeans(n_clusters=num_clusters)
gpu_model.fit(embeddings)
kmeans_model = KMeans(n_clusters=num_clusters)
kmeans_model.cluster_centers_= gpu_model.cluster_centers_;
kmeans_model.labels_= gpu_model.labels_;
kmeans_model.inertia_= gpu_model.inertia_;
kmeans_model._n_threads = _openmp_effective_n_threads()
...

Is not all pyspark API work with distributed?

I used under code VectorAssembler and StandardScaler to standardization.
But when VectorAssembler working, Spark job was not shown and very slow.
I can't know how many tasks succeeded, duration and so on.
How to make show in spark job when VectorAssembler working.
Or Is it impossible? If so i wonder why.
Under VectorAssembler method is very slow, how improve it to fast?
from pyspark.ml.feature import VectorAssembler, StandardScaler
def make_vector_assemble(df, input_cols, outputcol):
assembler = VectorAssembler().setInputCols(input_cols).setOutputCol(outputcol)
return assembler.transform(df)
def fit_standard_scalr(df, inputcol, outputcol, mean=True, std=True):
scaler = StandardScaler(
inputCol=inputcol, outputCol=outputcol, withMean=mean, withStd=std
)
scaler_model = scaler.fit(df)
return scaler_model.transform(df)
(this spark job is just example, not my result)

how do I standardize test dataset using StandardScaler in PySpark?

I have train and test datasets as below:
x_train:
inputs
[2,5,10]
[4,6,12]
...
x_test:
inputs
[7,8,14]
[5,5,7]
...
The inputs column is a vector containing the models features after applying the VectorAssembler class to 3 separate columns.
When I try to transform the test data using the StandardScaler as below, I get an error saying it doesn't have the transform method:
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaledTrainDF = scaler.fit(x_train).transform(x_train)
scaledTestDF = scaler.transform(x_test)
I am told that I should fit the standard scaler on the training data only once and use those parameters to transform the test set, so it is not accurate to do:
scaledTestDF = scaler.fit(x_test).transform(x_test)
So how do I deal with the error mentioned above?
Here is the correct syntax to use the scaler. You need to call transform on a fitted model, not on the scaler itself.
from pyspark.ml.feature import StandardScaler
scaler = StandardScaler(inputCol="inputs", outputCol="scaled_features")
scaler_model = scaler.fit(x_train)
scaledTrainDF = scaler_model.transform(x_train)
scaledTestDF = scaler_model.transform(x_test)

Deploy Keras model on Spark

I have a trained keras model.
https://github.com/qubvel/efficientnet
I have a large updating dataset I want to get predictions on. Meaning to run my spark job every 2 hours or so.
What is the way to implement this? MlLib does not support efficientNet.
When searching online I saw this kind of implementation using sparkdl, but it does not support efficentNet as modelName parameter.
featurizer = DeepImageFeaturizer(inputCol="image", outputCol="features", modelName="InceptionV3")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
My naive approach would be
import efficientnet.keras as efn
model = efn.EfficientNetB0(weights='imagenet')
from sparkdl import readImages
image_df = readImages("flower_photos/sample/")
image_df.withcolumn("modelTags", efficient_net_udf($"image".data))
and creating a UDF that calls model.predict...
Another method I saw is
from keras.preprocessing.image import img_to_array, load_img
import numpy as np
import os
from pyspark.sql.types import StringType
from sparkdl import KerasImageFileTransformer
import efficientnet.keras as efn
model = efn.EfficientNetB0(weights='imagenet')
model.save("kerasModel.h5")
def loadAndPreprocessKeras(uri):
image = img_to_array(load_img(uri, target_size=(299, 299)))
image = np.expand_dims(image, axis=0)
return image
transformer = KerasImageFileTransformer(inputCol="uri", outputCol="predictions",
modelFile='path/kerasModel.h5',
imageLoader=loadAndPreprocessKeras,
outputMode="vector")
files = [os.path.abspath(os.path.join(dirpath, f)) for f in os.listdir("/data/myimages") if f.endswith('.jpg')]
uri_df = sqlContext.createDataFrame(files, StringType()).toDF("uri")
keras_pred_df = transformer.transform(uri_df)
What is the correct (and working) way to approach this?

'RDD' object has no attribute '_jdf' pyspark RDD

I'm new in pyspark. I would like to perform some machine Learning on a text file.
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
and for my last command, i obtain the error
"AttributeError: 'RDD' object has no attribute '_jdf'
enter image description here
can anyone help me please.
thank you
You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as
train_data = spark.read.text("20ng-train-all-terms.txt")
from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
And then it should work so that you can call transform function as
vectorizer_transformer.transform(td).show(truncate=False)
Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)
But I would suggest you to stick with dataframe way.

Resources