real time inference on Spark Streaming - apache-spark

I'm trying to run real-time inference on a spark structured stream, first I trained the model
#model_creation
model.fit()
model.predict([ 33.26, 68.51, 1012.49, 52.68])
#create spark df from kafka stream
df = spark.readstream.format("kafka").....
#inference
def predict(input):
#extract json from input and conver to list of doubles
#model.predict(input_array)
#result = model.predict(input_list)
#return result
spark.udf.register("lr_predict", predict ,StringType())
df3 = df2.withColumn('predict_response',predict(col('value')))
display(df3)
I'm not sure how to extract the json input from sql spark dataframe and run it in model, I've been trying things since yesterday nothing seem to stick.

import json
def predict(input):
""" I am creating the function with following assumptions
please tell me if those are not correct
the model outputs a string value
the model need numeric data type as input
"""
#extract json from input and conver to list of doubles
#model.predict(input_array)
#result = model.predict(input_list)
#return result
input_proccesed=json.loads(input)
input_features_array=input_proccesed['input']
##casting to float just to be sure
input_features_array=[float(x) for x in input_features_array]
#predicting the output
result= model.predict(input_proccesed['input'])
return result
##if the output is string like - "GOOD" /"BAD" the use StringType()
##if the output is numeric like - 0.92 ,1028.384 the use DecimalType()
from pyspark.sql.types import StringType,DecimalType,FloatType,DoubleType
from pyspark.sql.functions import udf,col
lr_prediction=udf(predict,DoubleType())
df3 = df2.withColumn('predict_response_as_double',lr_prediction(col('value')))
lr_prediction=udf(predict,StringType())
df4 = df2.withColumn('predict_response_as_string',lr_prediction(col('value')))

Related

How to Convert Pyspark DF to fixedwidth and save

I have a requirement to scan a FixedWidth file using a specific schema and once this is done, the resulted DF with filters applied needs to be converted back to fixed width. How can we apply such transformations before the file being saved to s3. Below is what I have done.
df = spark.read.text(dataset_path)
# Dataframe with applied selection logic
df = df.select(
df.value.substr(1, 10).alias('name'),
df.value.substr(11, 20).alias('another_name'),
df.value.substr(31, 60).alias('address')
)
df = df.filter(df.name.isin('some_name'))
# Here is the dataframe which I need to convert to FixedWidth before saving.
df.save('s3a://somebucket/somepath')
Is there a way to get this done in PySpark?

PySpark UDF not recognizing number of arguments

I have defined a Python function "DateTimeFormat" which takes three arguments
Spark Dataframe column which has date formats (String)
The input format of column's value like yyyy-mm-dd (String)
The output format i.e. the format in which the input has to be returned like yyyymmdd (String)
I have now registered this function as UDF in Pyspark.
udf_date_time = udf(DateTimeFormat,StringType())
I am trying to call this UDF in dataframe select and it seems to be working fine as long as the input format and output are different like below
df.select(udf_date_time('entry_date',lit('mmddyyyy'),lit('yyyy-mm-dd')))
But it fails, when the input format and output format are same with the following error
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd')))
"DateTimeFormat" takes exactly 3 arguments. 2 given
But I'm clearly sending three arguments to the UDF
I have tried the above example on Python 2.7 and Spark 2.1
The function seems to work as expected in normal Python when input and output formats are the same
>>>DateTimeFormat('10152019','mmddyyyy','mmddyyyy')
'10152019'
>>>
But the below code is giving error when run in SPARK
import datetime
# Standard date,timestamp formatter
# Takes string date, its format and output format as arguments
# Returns string formatted date
def DateTimeFormat(col,in_frmt,out_frmt):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
Calling UDF using the code below
from pyspark.sql.functions import udf,lit
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
# Create SPARK session
spark = SparkSession.builder.appName("DateChanger").enableHiveSupport().getOrCreate()
df = spark.read.format("csv").option("header", "true").load(file_path)
# Registering UDF
udf_date_time = udf(DateTimeFormat,StringType())
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
CSV file input Input file
Expected result is the command
df.select('exit_date',udf_date_time('exit_date',lit('yyyy-mm-dd'),lit('yyyy-mm-dd'))).show()
should NOT throw any error like
DateTimeFormat takes exactly 3 arguments but 2 given
I am not sure if there's a better way to do this but you can try the following.
Here I have assumed that you want your dates to a particular format and have set the default for the output format (out_frmt='yyyy-mm-dd') in your DateTimeFormat function
I have added a new function called udf_score to help with conversions. That might interest you
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf, lit
df = spark.createDataFrame([
["10-15-2019"],
["10-16-2019"],
["10-17-2019"],
], ['exit_date'])
import datetime
def DateTimeFormat(col,in_frmt,out_frmt='yyyy-mm-dd'):
date_formatter ={'yyyy':'%Y','mm':'%m','dd':'%d','HH':'%H','MM':'%M','SS':'%S'}
for key,value in date_formatter.items():
in_frmt = in_frmt.replace(key,value)
out_frmt = out_frmt.replace(key,value)
return datetime.datetime.strptime(col,in_frmt).strftime(out_frmt)
def udf_score(in_frmt):
return udf(lambda l: DateTimeFormat(l, in_frmt))
in_frmt = 'mm-dd-yyyy'
df.select('exit_date',udf_score(in_frmt)('exit_date').alias('new_dates')).show()
+----------+----------+
| exit_date| new_dates|
+----------+----------+
|10-15-2019|2019-10-15|
|10-16-2019|2019-10-16|
|10-17-2019|2019-10-17|
+----------+----------+

Spark and categorical string variables

I'm trying to understand how spark.ml handles string categorical independent variables. I know that in Spark I have to convert strings to doubles using StringIndexer.
Eg., "a"/"b"/"c" => 0.0/1.0/2.0.
But what I really would like to avoid is then having to use OneHotEncoder on that column of doubles. This seems to make the pipeline unnecessarily messy. Especially since Spark knows that the data is categorical. Hopefully the sample code below makes my question clearer.
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.classification.LogisticRegression
val df = sqlContext.createDataFrame(Seq(
Tuple2(0.0,"a"), Tuple2(1.0, "b"), Tuple2(1.0, "c"), Tuple2(0.0, "c")
)).toDF("y", "x")
// index the string column "x"
val indexer = new StringIndexer().setInputCol("x").setOutputCol("xIdx").fit(df)
val indexed = indexer.transform(df)
// build a data frame of label, vectors
val assembler = (new VectorAssembler()).setInputCols(List("xIdx").toArray).setOutputCol("features")
val assembled = assembler.transform(indexed)
// build a logistic regression model and fit it
val logreg = (new LogisticRegression()).setFeaturesCol("features").setLabelCol("y")
val model = logreg.fit(assembled)
The logistic regression sees this as a model with only one independent variable.
model.coefficients
res1: org.apache.spark.mllib.linalg.Vector = [0.7667490491775728]
But the independent variable is categorical with three categories = ["a", "b", "c"]. I know I never did a one of k encoding but the metadata of the data frame knows that the feature vector is nominal.
import org.apache.spark.ml.attribute.AttributeGroup
AttributeGroup.fromStructField(assembled.schema("features"))
res2: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"attrs":
{"nominal":[{"vals":["c","a","b"],"idx":0,"name":"xIdx"}]},
"num_attrs":1}}
How do I pass this information to LogisticRegression? Is this not the whole point of keeping dataframe metadata? There does not seem to be a CategoricalFeaturesInfo in SparkML. Do I really need to do a 1 of k encoding for each categorical feature?
Maybe I am missing something, but this really looks like the job for RFormula (https://spark.apache.org/docs/latest/ml-features.html#rformula).
As the name suggests, it takes an "R-style" formula that describes how the feature vector is composed from the input data columns.
For each categorical input columns (that is, StringType as type) it adds a StringIndexer + OneHotEncoder to the final pipeline implementing the formula under the hoods.
The output is a feature vector (of doubles) that can be used with any algorithm in the org.apache.spark.ml package, as the one you are targeting.

Generate single json file for pyspark RDD

I am building a Python script in which I need to generate a json file from json RDD .
Following is code snippet for saving json file.
jsonRDD.map(lambda x :json.loads(x))
.coalesce(1, shuffle=True).saveAsTextFile('examples/src/main/resources/demo.json')
But I need to write json data to a single file instead of data distributed across several partitions.
So please suggest me appropriate solution for it
Without the use of additional libraries like pandas, you could save your RDD of several jsons by reducing them to one big string of jsons, each separated by a new line:
# perform your operation
# note that you do not need a lambda expression for json.loads
jsonRDD = jsonRDD.map(json.loads).coalesce(1, shuffle=True)
# map jsons back to string
jsonRDD = jsonRDD.map(json.dumps)
# reduce to one big string with one json on each line
json_string = jsonRDD.reduce(lambda x, y: x + "\n" + y)
# write your string to a file
with open("path/to/your.json", "w") as f:
f.write(json_string.encode("utf-8"))
I have had issues with pyspark saving off JSON files once I have them in a RDD or dataframe, so what I do is convert them to a pandas dataframe and save them to a non distributed directory.
import pandas
df1 = sqlContext.createDataFrame(yourRDD)
df2 = df1.toPandas()
df2.to_json(yourpath)

Using DataFrame with MLlib

Let's say I have a DataFrame (that I read in from a csv on HDFS) and I want to train some algorithms on it via MLlib. How do I convert the rows into LabeledPoints or otherwise utilize MLlib on this dataset?
Assuming you're using Scala:
Let's say your obtain the DataFrame as follows:
val results : DataFrame = sqlContext.sql(...)
Step 1: call results.printSchema() -- this will show you not only the columns in the DataFrame and (this is important) their order, but also what Spark SQL thinks are their types. Once you see this output things get a lot less mysterious.
Step 2: Get an RDD[Row] out of the DataFrame:
val rows: RDD[Row] = results.rdd
Step 3: Now it's just a matter of pulling whatever fields interest you out of the individual rows. For this you need to know the 0-based position of each field and it's type, and luckily you obtained all that in Step 1 above. For example,
let's say you did a SELECT x, y, z, w FROM ... and printing the schema yielded
root
|-- x double (nullable = ...)
|-- y string (nullable = ...)
|-- z integer (nullable = ...)
|-- w binary (nullable = ...)
And let's say all you wanted to use x and z. You can pull them out into an RDD[(Double, Integer)] as follows:
rows.map(row => {
// x has position 0 and type double
// z has position 2 and type integer
(row.getDouble(0), row.getInt(2))
})
From here you just use Core Spark to create the relevant MLlib objects. Things could get a little more complicated if your SQL returns columns of array type, in which case you'll have to call getList(...) for that column.
Assuming you're using JAVA (Spark version 1.6.2):
Here is a simple example of JAVA code using DataFrame for machine learning.
It loads a JSON with the following structure,
[{"label":1,"att2":5.037089672359123,"att1":2.4100883023159456}, ... ]
splits the data into training and testing,
train the model using the train data,
apply the model to the test data and
stores the results.
Moreover according to the official documentation the "DataFrame-based API is primary API" for MLlib since the current version 2.0.0. So you can find several examples using DataFrame.
The code:
SparkConf conf = new SparkConf().setAppName("MyApp").setMaster("local[2]");
SparkContext sc = new SparkContext(conf);
String path = "F:\\SparkApp\\test.json";
String outputPath = "F:\\SparkApp\\justTest";
System.setProperty("hadoop.home.dir", "C:\\winutils\\");
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
DataFrame df = sqlContext.read().json(path);
df.registerTempTable("tmp");
DataFrame newDF = df.sqlContext().sql("SELECT att1, att2, label FROM tmp");
DataFrame dataFixed = newDF.withColumn("label", newDF.col("label").cast("Double"));
VectorAssembler assembler = new VectorAssembler().setInputCols(new String[]{"att1", "att2"}).setOutputCol("features");
StringIndexer indexer = new StringIndexer().setInputCol("label").setOutputCol("labelIndexed");
// Split the data into training and test
DataFrame[] splits = dataFixed.randomSplit(new double[] {0.7, 0.3});
DataFrame trainingData = splits[0];
DataFrame testData = splits[1];
DecisionTreeClassifier dt = new DecisionTreeClassifier().setLabelCol("labelIndexed").setFeaturesCol("features");
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {assembler, indexer, dt});
// Train model
PipelineModel model = pipeline.fit(trainingData);
// Make predictions
DataFrame predictions = model.transform(testData);
predictions.rdd().coalesce(1,true,null).saveAsTextFile("justPlay.txt" +System.currentTimeMillis());
RDD based Mllib is on its way to be deprecated, so you should rather use DataFrame based Mllib.
Generally the input to these MLlib apis is a DataFrame containing 2 columns - label and feature. There are various methods to build this DataFrame - low level apis like org.apache.spark.mllib.linalg.{Vector, Vectors}, org.apache.spark.mllib.regression.LabeledPoint, org.apache.spark.mllib.linalg.{Matrix, Matrices} etc. They all take numeric values for feature and label.
Words can be converted to vectors using - org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}. This documentation explains more - https://spark.apache.org/docs/latest/mllib-data-types.html
Once input dataframe with label and feature is created, instantiate the MLlib api and pass in the DataFrame to 'fit' function to get the model and then call 'transform' or 'predict' function on the model to get the results.
Example -
training file looks like -
<numeric label> <a string separated by space>
//Build word vector
val trainingData = spark.read.parquet(<path to training file>)
val sampleDataDf = trainingData
.map { r =>
val s = r.getAs[String]("value").split(" ")
val label = s.head.toDouble
val feature = s.tail
(label, feature)
}.toDF("lable","feature_words")
val word2Vec = new Word2Vec()
.setInputCol("feature_words")
.setOutputCol("feature_vectors")
.setMinCount(0)
.setMaxIter(10)
//build word2Vector model
val model = word2Vec.fit(sampleDataDf)
//convert training text data to vector and labels
val wVectors = model.transform(sampleDataDf)
//train LinearSVM model
val svmAlgorithm = new LinearSVC()
.setFeaturesCol("feature_vectors")
.setMaxIter(100)
.setLabelCol("lable")
.setRegParam(0.01)
.setThreshold(0.5)
.fit(wVectors) //use word vectors created
//Predict new data, same format as training data containing words
val predictionData = spark.read.parquet(<path to prediction file>)
val pDataDf = predictionData
.map { r =>
val s = r.getAs[String]("value").split(" ")
val label = s.head.toDouble
val feature = s.tail
(label, feature)
}.toDF("lable","feature_words")
val pVectors = model.transform(pDataDf)
val predictionlResult = pVectors.map{ r =>
val s = r.getAs[Seq[String]]("feature_words")
val v = r.getAs[Vector]("feature_vectors")
val c = svmAlgorithm.predict(v) // predict using trained SVM
s"$c ${s.mkString(" ")}"
}

Resources