kd tree implementation in PySpark - apache-spark

I am trying to build a kdtree using pyspark. For this, I am using
UDF to recursively build kdtree from a 2-dimension list of floats.
Following is the piece of code I am trying:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import udf
from pyspark.sql.types import *
spark = SparkSession.builder.appName("SRDD").getOrCreate()
sc = spark.sparkContext
# Some sequence of floats
abc = [[0.0769,0.2982],[0.0863,0.30052],[0.0690,0.33337],[0.11975,0.2984],[0.07224,0.3467],[0.1316,0.2999]]
def build_kdtree(points,depth=0):
if n<=0:
return None
sorted_points=sorted(points,key=lambda point:point[axis])
'point': sorted_points[n/2],
'right':build_kdtree(sorted_points[n/2 + 1:],depth+1)
#This is how I'm trying to specify the return type of the function
#UDF registration
buildkdtree_udf=udf(build_kdtree, kdtree_schema)
#Function call
However, this throws TypeError: Invalid argument, not a string or column.
I have 2 main questions:
Is my approach to build kd tree in spark recursively correct?
The lines where I specify the return type of UDF as kdtree_schema correct?


How can I combine UDAF with functions in an groupby-aggregate expression?

I am trying to develope a custom describe. For do that, I will combine functions from pyspark.sql.functions with other user aggregated customized functions(UDAF).
The code looks like:
from pyspark.sql.functions import count
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import entropy
# Define a UDAF
#pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_entropy(data):
p_data = data.value_counts() # counts occurrence of each value
s = entropy(p_data) # get entropy from counts
return s
# Perform a groupby-agg
groupby_col = "a_column"
agg_col = "another_column"
df2return = df\
The error thrown is very long, so I only copy the last exception arised.
Do someone know how to fix that?

Adding a Vectors Column to a pyspark DataFrame

How do I add a Vectors.dense column to a pyspark dataframe?
import pandas as pd
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml.linalg import DenseVector
py_df = pd.DataFrame.from_dict({"time": [59., 115., 156., 421.], "event": [1, 1, 1, 0]})
sc = SparkContext(master="local")
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(py_df)
sdf.withColumn("features", DenseVector(1))
Gives an error in file anaconda3/lib/python3.6/site-packages/pyspark/sql/dataframe.py, line 1848:
AssertionError: col should be Column
It doesn't like the DenseVector type as a column. Essentially, I have a pandas dataframe that I'd like to transform to a pyspark dataframe and add a column of the type Vectors.dense. Is there another way of doing this?
Constant Vectors cannot be added as literal. You have to use udf:
from pyspark.sql.functions import udf
from pyspark.ml.linalg import VectorUDT
one = udf(lambda: DenseVector([1]), VectorUDT())
sdf.withColumn("features", one()).show()
But I am not sure why you need that at all. If you want to transform existing columns into Vectors use appropriate pyspark.ml tools, like VectorAssembler - Encode and assemble multiple features in PySpark
from pyspark.ml.feature import VectorAssembler
VectorAssembler(inputCols=["time"], outputCol="features").transform(sdf)

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
You can use the function as follows:
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
clusters = kmeans.predict(X).tolist()
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

Pyspark freezing after an operation on aggregated Data Frame

I am using Spark 1.5.2 with Python 2.7.5.
I have this code that I run in the pyspark repl:
from pyspark.sql import SQLContext
ctx = SQLContext(sc)
df = ctx.createDataFrame([("a",1),("a",1),("a",0),("a",0),("b",1),("b",0),("b",1)],["group","conversion"])
from pyspark.sql.functions import col, count, avg
funs = [(count,"total"),(avg,"cr")]
aggregate = ["conversion"]
exprs = [f(col(c)).alias(name) for f,name in funs for c in aggregate]
df3 = df.groupBy("group").agg(*exprs).cache()
So far the code works fine and I can check df3:
>>> df3.collect()
[Row(group=u'a', total=4, cr=0.5), Row(group=u'b', total=3, cr=0.6666666666666666)]
However, when I try:
PySpark can't calculate that sum. However df3.rdd.reduce(lambda x,y: x[2]+y[2]) works just fine.
So, what is the issue with the first command to calculate the sum?
You should import pyspark's sum function first: from pyspark.sql.functions import sum. Otherwise python's built-in sum is called which just sums the sequence of numbers.

pyspark expected zero arguments for construction of ClassDict (for pyspark.mllib.linalg.DenseVector)

I get the error
expected zero arguments for construction of ClassDict (for pyspark.mllib.linalg.DenseVector)
by trying this:
I have a function which I convert to a udf for transforming values of a column from a dataframe. Like this:
def func(vector):
#does something
return Vector.dense(vector)
udfunc = udf(func, ArrayType(FloatType()))
new_df = df.withColumn("vector",func(df.vector))
The column df.vector has denseVector values.
Has anybody an idea to fix this proplem or a hint ?
Thanks in Advance
Given the part of the you provided the obvious issue is that you declare incorrect return type. Catalyst type of Vector is VectorUDT not ArrayType(FloatType())
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.sql.types import ArrayType, FloatType
from pyspark.sql.functions import udf
dummy_udf = udf(lambda _: Vectors.dense([0, 0, 0]), VectorUDT())
sc.parallelize([(Vectors.dense([1, 1, 1]), )]).toDF(["x"]).select(dummy_udf("x"))
In Spark 2.0 and later use pyspark.ml.linalg to achieve compatibility with pyspark.ml API.
