Spark ML: Taking square root of feature columns - apache-spark

Hi I am using a custom UDF to take square root of each value in each column.
square_root_UDF = udf(lambda x: math.sqrt(x), DoubleType())
for x in features:
dataTraining = dataTraining.withColumn(x, square_root_UDF(x))
Is there any faster way to get it done ? Polynomial expansion function is not suitable in this case.

Don't use UDF. Instead use built-in:
from pyspark.sql.functions import sqrt
for x in features:
dataTraining = dataTraining.withColumn(x, sqrt(x))

To add sqrt results as a column in scala you need to do the following:
import hc.implicits._
import org.apache.spark.sql.functions.sqrt
val dataTraining = dataTraining.withColumn("x_std", sqrt('x_variance))

In order to speed-up your calculation in this case
put your data into a DataFrame (not RDD)
use vectorized operations (not lambda-operations with UDF) as suggested by #user7757642
this is an example if you dataTraining is an RDD then
from pyspark.sql import SparkSession
from pyspark.sql.functions import sqrt
spark = SparkSession.builder.appName("SessionName") \
.config("spark.some.config.option", "some_value") \
.getOrCreate()
df = spark.createDataFrame(dataTraining)
for x in features:
df = df.withColumn(x, sqrt(x))

Related

Appending column name to column value using Spark

I have data in comma separated file, I have loaded it in the spark data frame:
The data looks like:
A B C
1 2 3
4 5 6
7 8 9
I want to transform the above data frame in spark using pyspark as:
A B C
A_1 B_2 C_3
A_4 B_5 C_6
--------------
Then convert it to list of list using pyspark as:
[[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]]
And then run FP Growth algorithm using pyspark on the above data set.
The code that I have tried is below:
from pyspark.sql.functions import col, size
from pyspark.sql.functions import *
import pyspark.sql.functions as func
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)
df = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/data.csv")
names=df.schema.names
Then I thought of doing something inside for loop:
for name in names:
-----
------
After this I will be using fpgrowth:
df = spark.createDataFrame([
(0, [ A_1 , B_2 , C_3]),
(1, [A_4 , B_5 , C_6]),)], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
A number of concepts here for those who use Scala normally showing how to do with pyspark. Somewhat different but learnsome for sure, although to how many is the big question. I certainly learnt a point on pyspark with zipWithIndex myself. Anyway.
First part is to get stuff into desired format, probably too may imports but leaving as is:
from functools import reduce
from pyspark.sql.functions import lower, col, lit, concat, split
from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql import functions as f
source_df = spark.createDataFrame(
[
(1, 11, 111),
(2, 22, 222)
],
["colA", "colB", "colC"]
)
intermediate_df = (reduce(
lambda df, col_name: df.withColumn(col_name, concat(lit(col_name), lit("_"), col(col_name))),
source_df.columns,
source_df
) )
allCols = [x for x in intermediate_df.columns]
result_df = intermediate_df.select(f.concat_ws(',', *allCols).alias('CONCAT_COLS'))
result_df = result_df.select(split(col("CONCAT_COLS"), ",\s*").alias("ARRAY_COLS"))
# Add 0,1,2,3, ... with zipWithIndex, we add it at back, but that does not matter, you can move it around.
# Get new Structure, the fields (one in this case but done flexibly, plus zipWithIndex value.
schema = StructType(result_df.schema.fields[:] + [StructField("index", LongType(), True)])
# Need this dict approach with pyspark, different to Scala.
rdd = result_df.rdd.zipWithIndex()
rdd1 = rdd.map(
lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],)
)
final_result_df = spark.createDataFrame(rdd1, schema)
final_result_df.show(truncate=False)
returns:
+---------------------------+-----+
|ARRAY_COLS |index|
+---------------------------+-----+
|[colA_1, colB_11, colC_111]|0 |
|[colA_2, colB_22, colC_222]|1 |
+---------------------------+-----+
Second part is the old zipWithIndex with pyspark if you need 0,1,.. Painful compared to Scala.
In general easier to solve in Scala.
Not sure on performance, not a foldLeft, interesting. I think it is OK actually.

kd tree implementation in PySpark

I am trying to build a kdtree using pyspark. For this, I am using
UDF to recursively build kdtree from a 2-dimension list of floats.
Following is the piece of code I am trying:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import udf
from pyspark.sql.types import *
spark = SparkSession.builder.appName("SRDD").getOrCreate()
sc = spark.sparkContext
# Some sequence of floats
abc = [[0.0769,0.2982],[0.0863,0.30052],[0.0690,0.33337],[0.11975,0.2984],[0.07224,0.3467],[0.1316,0.2999]]
def build_kdtree(points,depth=0):
n=points.count()
if n<=0:
return None
axis=depth%2
sorted_points=sorted(points,key=lambda point:point[axis])
return{
'point': sorted_points[n/2],
'left':build_kdtree(sorted_points[:n/2],depth+1),
'right':build_kdtree(sorted_points[n/2 + 1:],depth+1)
}
#This is how I'm trying to specify the return type of the function
kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',StructType(),nullable=True),StructField('right',StructType(),nullable=True)])
kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',kdtree_schema,nullable=True),StructField('right',kdtree_schema,nullable=True)])
#UDF registration
buildkdtree_udf=udf(build_kdtree, kdtree_schema)
#Function call
pointskdtree=buildkdtree_udf(abc)
However, this throws TypeError: Invalid argument, not a string or column.
I have 2 main questions:
Is my approach to build kd tree in spark recursively correct?
The lines where I specify the return type of UDF as kdtree_schema correct?

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

Pyspark freezing after an operation on aggregated Data Frame

I am using Spark 1.5.2 with Python 2.7.5.
I have this code that I run in the pyspark repl:
from pyspark.sql import SQLContext
ctx = SQLContext(sc)
df = ctx.createDataFrame([("a",1),("a",1),("a",0),("a",0),("b",1),("b",0),("b",1)],["group","conversion"])
from pyspark.sql.functions import col, count, avg
funs = [(count,"total"),(avg,"cr")]
aggregate = ["conversion"]
exprs = [f(col(c)).alias(name) for f,name in funs for c in aggregate]
df3 = df.groupBy("group").agg(*exprs).cache()
So far the code works fine and I can check df3:
>>> df3.collect()
[Row(group=u'a', total=4, cr=0.5), Row(group=u'b', total=3, cr=0.6666666666666666)]
However, when I try:
df3.agg(sum(col('cr'))).first()[0]
PySpark can't calculate that sum. However df3.rdd.reduce(lambda x,y: x[2]+y[2]) works just fine.
So, what is the issue with the first command to calculate the sum?
You should import pyspark's sum function first: from pyspark.sql.functions import sum. Otherwise python's built-in sum is called which just sums the sequence of numbers.

pyspark expected zero arguments for construction of ClassDict (for pyspark.mllib.linalg.DenseVector)

I get the error
expected zero arguments for construction of ClassDict (for pyspark.mllib.linalg.DenseVector)
by trying this:
I have a function which I convert to a udf for transforming values of a column from a dataframe. Like this:
def func(vector):
#does something
return Vector.dense(vector)
udfunc = udf(func, ArrayType(FloatType()))
new_df = df.withColumn("vector",func(df.vector))
new_df.show()
The column df.vector has denseVector values.
Has anybody an idea to fix this proplem or a hint ?
Thanks in Advance
Given the part of the you provided the obvious issue is that you declare incorrect return type. Catalyst type of Vector is VectorUDT not ArrayType(FloatType())
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.types import ArrayType, FloatType
from pyspark.sql.functions import udf
dummy_udf = udf(lambda _: Vectors.dense([0, 0, 0]), VectorUDT())
sc.parallelize([(Vectors.dense([1, 1, 1]), )]).toDF(["x"]).select(dummy_udf("x"))
In Spark 2.0 and later use pyspark.ml.linalg to achieve compatibility with pyspark.ml API.

Resources