Is not all pyspark API work with distributed? - apache-spark

I used under code VectorAssembler and StandardScaler to standardization.
But when VectorAssembler working, Spark job was not shown and very slow.
I can't know how many tasks succeeded, duration and so on.
How to make show in spark job when VectorAssembler working.
Or Is it impossible? If so i wonder why.
Under VectorAssembler method is very slow, how improve it to fast?
from pyspark.ml.feature import VectorAssembler, StandardScaler
def make_vector_assemble(df, input_cols, outputcol):
assembler = VectorAssembler().setInputCols(input_cols).setOutputCol(outputcol)
return assembler.transform(df)
def fit_standard_scalr(df, inputcol, outputcol, mean=True, std=True):
scaler = StandardScaler(
inputCol=inputcol, outputCol=outputcol, withMean=mean, withStd=std
)
scaler_model = scaler.fit(df)
return scaler_model.transform(df)
(this spark job is just example, not my result)

Related

Error saving a linear regression model with MLLib

Trying to save my linear regression model to disk I receive this error: "TypeError: save() takes 2 positional arguments but 3 were given"
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.regression import LinearRegression
sc= SparkContext()
lr = LinearRegression(featuresCol = 'features', labelCol='NextOrderInDays', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
lr_model.save(sc, "lr_model.model")
Searching the web outputs something similar to what I wrote. What do I miss as 3rd argument?
Thanks
You use the ml package not the mllib: from pyspark.ml.regression import LinearRegression.
So the save function has only one argument: the path (cf. documentation).

How to convert a rdd of pandas DataFrame to Spark DataFrame

I create a rdd of pandas DataFrame as intermediate result. I want to convert a Spark DataFrame, eventually save it into parquet file.
I want to know what is the efficient way.
Thanks
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).\
assign(col=x)
sc.parallelize(range(5)).map(create_df).\
.TO_DATAFRAME()..write.format("parquet").save("parquet_file")
I have tried pd.concat to reduce rdd to a big dataframe, seems not right.
So talking of efficiency, since spark 2.3 Apache Arrow is integrated with Spark and it is supposed to efficiently transfer data between JVM and Python processes thus enhancing the performance of the conversion from pandas dataframe to spark dataframe. You can enable it by
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
If your spark distribution doesn't have arrow integrated, this should not throw an error, will just be ignored.
A sample code to be run at pyspark shell can be like below:
import numpy as np
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf = pd.DataFrame(np.random.rand(100, 3))
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
Your create_df method returns a panda dataframe and from that you can create spark dataframe - not sure why you need "sc.parallelize(range(5)).map(create_df)"
So your full code can be like
import pandas as pd
import numpy as np
def create_df(x):
return pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
pdf = create_df(10)
df = spark.createDataFrame(pdf)
df.write.format("parquet").save('data_parquet_file')
import pandas as pd
def create_df(x):
df=pd.DataFrame(np.random.rand(5, 3)).assign(col=x)
return df.values.tolist()
sc.parallelize(range(5)).flatMap(create_df).toDF().\
.write.format("parquet").save("parquet_file")

Spark matrix multiplication code takes a lot of time to execute

I have a simple PySpark environment set up using findspark.init() on Spyder and I'm running the code on localhost. I am confused as to how can simple matrix multiplication take hours and hours of time using BlockMatrix in Spark, whereas the same code takes a few mins to run on numpy.
Here's the code I'm using:
import numpy as np
import pandas as pd
from sklearn import cross_validation as cv
import itertools
import random
import findspark
import time
start=time.time()
findspark.init()
from pyspark.mllib.linalg.distributed import *
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName('myapp')
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
from pyspark.mllib.linalg.distributed import *
def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
return IndexedRowMatrix(
rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
).toBlockMatrix(rowsPerBlock, colsPerBlock)
def prediction(P,Q):
# np.r_[ pp,np.zeros(len(pp)) ].reshape(2,20)
Pn=np.r_[ P,np.zeros(len(P)),np.zeros(len(P)),np.zeros(len(P)),np.zeros(len(P)) ].reshape(5,len(P))
Qn=np.r_[ Q,np.zeros(len(Q)),np.zeros(len(Q)),np.zeros(len(Q)),np.zeros(len(Q)) ].reshape(5,len(Q))
A = Pn[:1]
B = Qn[:1].T
distP = sc.parallelize(A)
distQ = sc.parallelize(B)
mat=as_block_matrix(distP).multiply(as_block_matrix(distQ))
blocksRDD = mat.blocks
m=(list(blocksRDD.collect())[0][1])
#print(m)
return m.toArray()[0,0]
for epoch in range(1):
for u, i in zip(users,items):
e = R[u, i] - prediction(P[:,u],Q[:,i])
Not knowing the size of your matrices makes it more difficult to answer this question, but if you are working with high dimensional sparse matrices, one possible issue is inherent to the way pyspark does matrix multiplication. In order to multiply sparse matrices, pyspark converts the sparse matrices to dense matrices. This is noted in the documentation:
http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix.multiply
which states that:
multiply(other) Left multiplies this BlockMatrix by other, another BlockMatrix. The colsPerBlock of this matrix must equal the rowsPerBlock of other. If other contains any SparseMatrix blocks, they will have to be converted to DenseMatrix blocks. The output BlockMatrix will only consist of DenseMatrix blocks. This may cause some performance issues until support for multiplying two sparse matrices is added.
As far as I know, there isn't a good work around for this if you intend to use the built in matrix data types. One way to fix is to abandon the matrix datatypes and hand roll your own matrix multiplication using rdd or dataframe join operations. For example, if you can use dataframes, the following has been tested and works fairly well at scale:
from pyspark.sql.functions import sum
def multiply_df_matrices(A,B):
return A.join(B,A['column']==B['row'])\
.groupBy(A['row'],B['column'])\
.agg(sum(A['value']*B['value']).alias('value'))
You can do something similar by joining two rdds.

PySpark, Decision Trees (Spark 2.0.0)

I am new to Spark (using PySpark). I tried running the Decision Tree tutorial from here (link). I execute the code:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils
# Load and parse the data file, converting it to a DataFrame.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Now this line fails
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)
I get the error message:
IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT#3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT#f71b0bce.'
When searching the web for this error I found an answer that says:
use
from pyspark.ml.linalg import Vectors, VectorUDT
instead of
from pyspark.mllib.linalg import Vectors, VectorUDT
which is odd, since I haven't used it. Also, adding this import to my code solves nothing and I still get the same error.
I am not quite clear on how to debug this situation. When looking into the raw data I see:
data.show()
+--------------------+-----+
| features|label|
+--------------------+-----+
|(692,[127,128,129...| 0.0|
|(692,[158,159,160...| 1.0|
|(692,[124,125,126...| 1.0|
|(692,[152,153,154...| 1.0|
This looks like a list, starts with '('.
I am not sure how to solve this issue, or even debug.
The source of the problem seems to be executing spark 1.5.2. example on spark 2.0.0 (see below reference to spark 2.0 example).
The difference between spark.ml and spark.mllib
As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.
More details can be found here: http://spark.apache.org/docs/latest/ml-guide.html
Using spark 2.0 please try Spark 2.0.0 example (https://spark.apache.org/docs/2.0.0/mllib-decision-tree.html)
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
# Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
impurity='gini', maxDepth=5, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
# Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
Find full example code at "examples/src/main/python/mllib/decision_tree_classification_example.py" in the Spark repo.

Spark Streaming: How to load a Pipeline on a Stream?

I am implementing a lambda architecture system for stream processing.
I have no issue creating a Pipeline with GridSearch in Spark Batch:
pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ..., assembler, logistic_regressor])
paramGrid = (
ParamGridBuilder()
.addGrid(logistic_regressor.regParam, (0.01, 0.1))
.addGrid(logistic_regressor.tol, (1e-5, 1e-6))
...etcetera
).build()
cv = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=4)
pipeline_cv = cv.fit(raw_train_df)
model_fitted = pipeline_cv.getEstimator().fit(raw_validation_df)
model_fitted.write().overwrite().save("pipeline")
However, I cant seem to find how to plug the pipeline in the Spark Streaming Process. I am using kafka as the DStream source and my code as of now is as follows:
import json
from pyspark.ml import PipelineModel
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
ssc = StreamingContext(sc, 1)
kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "spark- streaming-consumer", {"kafka_topic": 1})
model = PipelineModel.load('pipeline/')
parsed_stream = kafkaStream.map(lambda x: json.loads(x[1]))
CODE MISSING GOES HERE
ssc.start()
ssc.awaitTermination()
and now I need to find someway of doing
Based on the documentation here (even though it looks very very outdated) it seems like your model needs to implement the method predict to be able to use it on an rdd object (and hopefully on a kafkastream?)
How could I use the pipeline on the Streaming context? The reloaded PipelineModel only seems to implement transform
Does that mean the only way to use batch models in a Streaming context is to use pure models ,and no pipelines?
I found a way to load a Spark Pipeline into spark streaming.
This solution works for Spark v2.0 , as further versions will probably implement a better solution.
The solution I found transforms the streaming RDDs into Dataframes using the toDF() method, in which you can then apply the pipeline.transform method.
This way of doing things is horribly inefficient though.
# we load the required libraries
from pyspark.sql.types import (
StructType, StringType, StructField, LongType
)
from pyspark.sql import Row
from pyspark.streaming.kafka import KafkaUtils
#we specify the dataframes schema, so spark does not have to do reflections on the data.
pipeline_schema = StructType(
[
StructField("field1",StringType(),True),
StructField("field2",StringType(),True),
StructField("field3", LongType(),True)
]
)
#We load the pipeline saved with spark batch
pipeline = PipelineModel.load('/pipeline')
#Setup usual spark context, and spark Streaming Context
sc = spark.sparkContext
ssc = StreamingContext(sc, 1)
#On my case I use kafka directKafkaStream as the DStream source
directKafkaStream = KafkaUtils.createDirectStream(ssc, suwanpos[QUEUE_NAME], {"metadata.broker.list": "localhost:9092"})
def handler(req_rdd):
def process_point(p):
#here goes the logic to do after applying the pipeline
print(p)
if req_rdd.count() > 0:
#Here is the gist of it, we turn the rdd into a Row, then into a df with the specified schema)
req_df = req_rdd.map(lambda r: Row(**r)).toDF(schema=pipeline_schema)
#Now we can apply the transform, yaaay
pred = pipeline.transform(req_df)
records = pred.rdd.map(lambda p: process_point(p)).collect()
Hope this helps.

Resources