How to use a scikit pickle model in spark structured streaming? [duplicate] - apache-spark

I'm trying to apply a scikit model retrieved using a pickle to every row of a structured streaming dataframe.
I've tried using pandas_udf (version code 1), and it gives me this error:
AttributeError: 'numpy.ndarray' object has no attribute 'isnull'
Code:
inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *
data_schema = data_spark_ts.schema
import pandas as pd
from pyspark.sql.functions import col, pandas_udf, PandasUDFType # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType
get_prediction = pandas_udf(lambda x: gb2.predict(x), IntegerType())
streamingInputDF = (
spark
.readStream
.schema(data_schema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputPath)
.fillna(0)
.withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)
display(streamingInputDF.select("prediction"))
I've tried also using a normal udf instead of the pandas_udf, and it gives me this error:
ValueError: Expected 2D array, got 1D array instead:
[.. ... .. ..]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't know how to reshape my data.
The model I try to apply is retrieved this way:
#load the pickle
import pickle
gb2 = None
with open('pickle_modello_unico.p', 'rb') as fp:
gb2 = pickle.load(fp)
And it's specification is this one:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=300,
n_iter_no_change=None, presort='auto', random_state=None,
subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False)
Any help to solve this?

I solved the issue returning a pd.Series from the pandas_udf.
Here is the working code:
inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *
data_schema = data_spark_ts.schema
import pandas as pd
from pyspark.sql.functions import col, pandas_udf, PandasUDFType # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType
get_prediction = pandas_udf(lambda x: pd.Series(gb2.predict(x)), StringType())
streamingInputDF = (
spark
.readStream
.schema(data_schema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputPath)
.withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)
display(streamingInputDF.select("prediction"))

Related

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error
'DataFrame' object has no attribute 'fromDF'"
My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to dynamic frame so I can write out as glueparquet? If so could you please provide an example, and point out what I'm doing wrong below?
code:
# importing libraries
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# updated 11/19/19 for error caused in error logging function
spark = glueContext.spark_session
from pyspark.sql import Window
from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import substring, length, min,when,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
base_pth='s3://test/'
bckt_pth1=base_pth+'test_write/glueparquet/'
test_df=glueContext.create_dynamic_frame.from_catalog(
database='test_inventory',
table_name='inventory_tz_inventory').toDF()
test_df.fromDF(test_df, glueContext, "test_nest")
glueContext.write_dynamic_frame.from_options(frame = test_nest,
connection_type = "s3",
connection_options = {"path": bckt_pth1+'inventory'},
format = "glueparquet")
error:
'DataFrame' object has no attribute 'fromDF'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1574556353910_0001/container_1574556353910_0001_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'fromDF'
fromDF is a class function. Her's how you can convert Dataframe to DynamicFrame
from awsglue.dynamicframe import DynamicFrame
DynamicFrame.fromDF(test_df, glueContext, "test_nest")
Just to consolidate the answers for Scala users too, here's how to transform a Spark Dataframe to a DynamicFrame (the method fromDF doesn't exist in the scala API of the DynamicFrame) :
import com.amazonaws.services.glue.DynamicFrame
val dynamicFrame = DynamicFrame(df, glueContext)
I hope it helps !

Appending column name to column value using Spark

I have data in comma separated file, I have loaded it in the spark data frame:
The data looks like:
A B C
1 2 3
4 5 6
7 8 9
I want to transform the above data frame in spark using pyspark as:
A B C
A_1 B_2 C_3
A_4 B_5 C_6
--------------
Then convert it to list of list using pyspark as:
[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]
And then run FP Growth algorithm using pyspark on the above data set.
The code that I have tried is below:
from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")
names=df.schema.names
Then I thought of doing something inside for loop:
for name in names:
-----
------
After this I will be using fpgrowth:
df = spark.createDataFrame([
(0, [ A_1 , B_2 , C_3]),
(1, [A_4 , B_5 , C_6]),)], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
A number of concepts here for those who use Scala normally showing how to do with pyspark. Somewhat different but learnsome for sure, although to how many is the big question. I certainly learnt a point on pyspark with zipWithIndex myself. Anyway.
First part is to get stuff into desired format, probably too may imports but leaving as is:
from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as f
source_df = spark.createDataFrame(
[
(1, 11, 111),
(2, 22, 222)
],
["colA", "colB", "colC"]
)
intermediate_df = (reduce(
lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
source_df.columns,
source_df
) )
allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))
result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))
# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])
# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
)
final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)
returns:
+---------------------------+-----+
|ARRAY_COLS |index|
+---------------------------+-----+
|[colA_1, colB_11, colC_111]|0 |
|[colA_2, colB_22, colC_222]|1 |
+---------------------------+-----+
Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.
In general easier to solve in Scala.
Not sure on performance, not a foldLeft, interesting. I think it is OK actually.

Adding a Vectors Column to a pyspark DataFrame

How do I add a Vectors.dense column to a pyspark dataframe?
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector
py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})
sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))
Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py, line 1848:
AssertionError: col should be Column
It doesn't like the DenseVector type as a column. Essentially, I have a pandas dataframe that I'd like to transform to a pyspark dataframe and add a column of the type Vectors.dense. Is there another way of doing this?
Constant Vectors cannot be added as literal. You have to use udf:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()
But I am not sure why you need that at all. If you want to transform existing columns into Vectors use appropriate pyspark.ml tools, like VectorAssembler - Encode and assemble multiple features in PySpark
from pyspark.ml.feature import VectorAssembler
VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

How to access element of a VectorUDT column in a Spark DataFrame?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element?
I've tried doing the following
from pyspark.sql.functions import udf
first_elem_udf = udf(lambda row: row.values[0])
df.select(first_elem_udf(df.features)).show()
but I get a net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict(for numpy.dtype) error. Same error if I do first_elem_udf = first_elem_udf(lambda row: row.toArray()[0]) instead.
I also tried explode() but I get an error because it requires an array or map type.
This should be a common operation, I think.
Convert output to float:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
Example usage:
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.select(ith("features", lit(1))).show()
## +-----------------+
## |ith_(features, 1)|
## +-----------------+
## | 2.0|
## | 9.0|
## +-----------------+
Explanation:
Output values have to be reserialized to equivalent Java objects. If you want to access values (beware of SparseVectors) you should use item method:
v.values.item(0)
which return standard Python scalars. Similarly if you want to access all values as a dense structure:
v.toArray().tolist()
If you prefer using spark.sql, you can use the follow custom function 'to_array' to convert the vector to array. Then you can manipulate it as an array.
from pyspark.sql.types import ArrayType, DoubleType
def to_array_(v):
return v.toArray().tolist()
from pyspark.sql import SQLContext
sqlContext=SQLContext(spark.sparkContext, sparkSession=spark, jsqlContext=None)
sqlContext.udf.register("to_array",to_array_, ArrayType(DoubleType()))
example
from pyspark.ml.linalg import Vectors
df = sc.parallelize([
(1, Vectors.dense([1, 2, 3])),
(2, Vectors.sparse(3, [1], [9]))
]).toDF(["id", "features"])
df.createOrReplaceTempView("tb")
spark.sql("""select * , to_array(features)[1] Second from tb """).toPandas()
output
id features Second
0 1 [1.0, 2.0, 3.0] 2.0
1 2 (0.0, 9.0, 0.0) 9.0
I ran into the same problem with not being able to use explode(). One thing you can do is use VectorSlice from the pyspark.ml.feature library. Like so:
from pyspark.ml.feature import VectorSlicer
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
slicer = VectorSlicer(inputCol="features", outputCol="features_one", indices=[0])
output = slicer.transform(df)
output.select("features", "features_one").show()
For anyone trying to split the probability columns generated after training a PySpark ML model into usable columns. This does not use UDF or numpy. And this will only work for binary classification. Here lr_pred is the dataframe which has the predictions from the Logistic Regression Model.
prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))
prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))
Since Spark 3.0.0 this can be done without using UDF.
from pyspark.ml.functions import vector_to_array
https://discuss.dizzycoding.com/how-to-split-vector-into-columns-using-pyspark/
Why is Vector[Double] is used in the results? That's not a very nice data type.

Resources