How can I combine UDAF with functions in an groupby-aggregate expression? - apache-spark

I am trying to develope a custom describe. For do that, I will combine functions from pyspark.sql.functions with other user aggregated customized functions(UDAF).
The code looks like:
from pyspark.sql.functions import count
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import entropy
# Define a UDAF
#pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_entropy(data):
p_data = data.value_counts() # counts occurrence of each value
s = entropy(p_data) # get entropy from counts
return s
# Perform a groupby-agg
groupby_col = "a_column"
agg_col = "another_column"
df2return = df\
.groupBy(groupby_cols)\
.agg(count(agg_col).alias("count"),
my_entropy(agg_col).alias("s"))
df2return.show()
The error thrown is very long, so I only copy the last exception arised.
Do someone know how to fix that?

Related

Pandas pandarallel parallel_aply

Here is a simple program that works in parallel. But in has an issue when I want to use a previous result to apply.
import pandas as pd
import numpy as np
from pandarallel import pandarallel
pandarallel.initialize(nb_workers=8) # nb_workers=NUMBER_OF_CPU_CORES
def dummy_fit(x, y_hint=0.5):
# Imagine quite a complicated code here
# y_hint is a previous fit. When it is not given, use default
y = (x.mean() + y_hint) / 2
return y
df = pd.DataFrame(np.random.random((10, 3)), columns=list("ABC"))
print("data:\n", df)
result = df.parallel_apply(dummy_fit, axis=1)
print(result)
We can use a global variable, but it is only one (we have more threads)
How to make it work in parallel?

kd tree implementation in PySpark

I am trying to build a kdtree using pyspark. For this, I am using
UDF to recursively build kdtree from a 2-dimension list of floats.
Following is the piece of code I am trying:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql.functions import udf
from pyspark.sql.types import *
spark = SparkSession.builder.appName("SRDD").getOrCreate()
sc = spark.sparkContext
# Some sequence of floats
abc = [[0.0769,0.2982],[0.0863,0.30052],[0.0690,0.33337],[0.11975,0.2984],[0.07224,0.3467],[0.1316,0.2999]]
def build_kdtree(points,depth=0):
n=points.count()
if n<=0:
return None
axis=depth%2
sorted_points=sorted(points,key=lambda point:point[axis])
return{
'point': sorted_points[n/2],
'left':build_kdtree(sorted_points[:n/2],depth+1),
'right':build_kdtree(sorted_points[n/2 + 1:],depth+1)
}
#This is how I'm trying to specify the return type of the function
kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',StructType(),nullable=True),StructField('right',StructType(),nullable=True)])
kdtree_schema=StructType([StructField('point',ArrayType(FloatType()),nullable=True),StructField('left',kdtree_schema,nullable=True),StructField('right',kdtree_schema,nullable=True)])
#UDF registration
buildkdtree_udf=udf(build_kdtree, kdtree_schema)
#Function call
pointskdtree=buildkdtree_udf(abc)
However, this throws TypeError: Invalid argument, not a string or column.
I have 2 main questions:
Is my approach to build kd tree in spark recursively correct?
The lines where I specify the return type of UDF as kdtree_schema correct?

Pyspark: applying kmeans on different groups of a dataframe

Using Pyspark I would like to apply kmeans separately on groups of a dataframe and not to the whole dataframe at once. For the moment I use a for loop which iterates on each group, applies kmeans and appends the result to another table. But having a lot of groups makes it time consuming. Anyone could help me please??
Thanks a lot!
for customer in customer_list:
temp_df = togroup.filter(col("customer_id")==customer)
df = assembler.transform(temp_df)
k = 1
while (k < 5 & mtrc < width):
k += 1
kmeans = KMeans(k=k,seed=5,maxIter=20,initSteps=5)
model = kmeans.fit(df)
mtric = 1 - model.computeCost(df)/ttvar
a = model.transform(df)select(cols)
allcustomers = allcustomers .union(a)
I came up with a solution using pandas_udf. A pure spark or scala solution is preferred and yet to be offered.
Assume my data is
import pandas as pd
df_pd = pd.DataFrame([['cat1',10.],['cat1',20.],['cat1',11.],['cat1',21.],['cat1',22.],['cat1',9.],['cat2',101.],['cat2',201.],['cat2',111.],['cat2',214.],['cat2',224.],['cat2',99.]],columns=['cat','val'])
df_sprk = spark.createDataFrame(df_pd)
First solve the problem in pandas:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
def skmean(kmeans,x):
X = np.array(x)
kmeans.fit(X)
return(kmeans.predict(X))
You can apply skmean() to a panda data frame (to make sure it works properly):
df_pd.groupby('cat').apply(lambda x:skmean(kmeans,x)).reset_index()
To apply the function to pyspark data frame, we use pandas_udf. But first define a schema for the output data frame:
from pyspark.sql.types import *
schema = StructType(
[StructField('cat',StringType(),True),
StructField('clusters',ArrayType(IntegerType()))])
Convert the function above to a pandas_udf:
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def skmean_udf(df):
result = pd.DataFrame(
df.groupby('cat').apply(lambda x: skmean(kmeans,x))
result.reset_index(inplace=True, drop=False)
return(result)
You can use the function as follows:
df_spark.groupby('cat').apply(skmean_udf).show()
I came up with a second solution which is I think is slightly better than the last one. The idea is to use groupby() together withcollect_list() and write a udf that takes a list as input and generates the clusters. Continuing with df_spark in the other solution we write:
df_flat = df_spark.groupby('cat').agg(F.collect_list('val').alias('val_list'))
Now we write the udf function:
import numpy as np
import pyspark.sql.functions as F
from sklearn.cluster import KMeans
from pyspark.sql.types import *
def skmean(x):
kmeans = KMeans(n_clusters=2, random_state=0)
X = np.array(x).reshape(-1,1)
kmeans.fit(X)
clusters = kmeans.predict(X).tolist()
return(clusters)
clustering_udf = F.udf(lambda arr : skmean(arr), ArrayType(IntegerType()))
Then apply the udf to the flattened dataframe:
df = df_flat.withColumn('clusters', clustering_udf(F.col('val')))
Then you can use F.explode() to convert the list to a column.

Dask: How would I parallelize my code with dask delayed?

This is my first venture into parallel processing and I have been looking into Dask but I am having trouble actually coding it.
I have had a look at their examples and documentation and I think dask.delayed will work best. I attempted to wrap my functions with the delayed(function_name), or add an #delayed decorator, but I can't seem to get it working properly. I preferred Dask over other methods since it is made in python and for its (supposed) simplicity. I know dask doesn't work on the for loop, but they say it can work inside a loop.
My code passes files through a function that contains inputs to other functions and looks like this:
from dask import delayed
filenames = ['1.csv', '2.csv', '3.csv', etc. etc. ]
for count, name in enumerate(filenames)"
name = name.split('.')[0]
....
then do some pre-processing ex:
preprocess1, preprocess2 = delayed(read_files_and_do_some_stuff)(name)
then I call a constructor and pass the pre_results in to the function calls:
fc = FunctionCalls()
Daily = delayed(fc.function_runs)(filename=name, stringinput='Daily',
input_data=pre_result1, model1=pre_result2)
What i do here is I pass the file into the for loop, do some pre-processing and then pass the file into two models.
Thoughts or tips on how to do parallelize this? I began getting odd errors and I had no idea how to fix the code. The code does work as is. I use a bunch of pandas dataframes, series, and numpy arrays, and I would prefer not to go back and change everything to work with dask.dataframes etc.
The code in my comment may be difficult to read. Here it is in a more formatted way.
In the code below, when I type print(mean_squared_error) I just get: Delayed('mean_squared_error-3009ec00-7ff5-4865-8338-1fec3f9ed138')
from dask import delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = ['file1.csv']
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = delayed(mse)(observed, prediction)
You need to call dask.compute to eventually compute the result. See dask.delayed documentation.
Sequential code
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
results = []
for count, name in enumerate(filenames):
file1 = pd.read_csv(name)
df = pd.DataFrame(file1) # isn't this already a dataframe?
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = mse(observed, prediction)
results.append(mean_squared_error)
Parallel code
import dask
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
delayed_results = []
for count, name in enumerate(filenames):
df = dask.delayed(pd.read_csv)(name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
mean_squared_error = dask.delayed(mse)(observed, prediction)
delayed_results.append(mean_squared_error)
results = dask.compute(*delayed_results)
A much clearer solution, IMO, than the accepted answer is this snippet.
from dask import compute, delayed
import pandas as pd
from sklearn.metrics import mean_squared_error as mse
filenames = [...]
def compute_mse(file_name):
df = pd.read_csv(file_name)
prediction = df['Close'][:-1]
observed = df['Close'][1:]
return mse(observed, prediction)
delayed_results = [delayed(compute_mse)(file_name) for file_name in filenames]
mean_squared_errors = compute(*delayed_results, scheduler="processes")

Pyspark freezing after an operation on aggregated Data Frame

I am using Spark 1.5.2 with Python 2.7.5.
I have this code that I run in the pyspark repl:
from pyspark.sql import SQLContext
ctx = SQLContext(sc)
df = ctx.createDataFrame([("a",1),("a",1),("a",0),("a",0),("b",1),("b",0),("b",1)],["group","conversion"])
from pyspark.sql.functions import col, count, avg
funs = [(count,"total"),(avg,"cr")]
aggregate = ["conversion"]
exprs = [f(col(c)).alias(name) for f,name in funs for c in aggregate]
df3 = df.groupBy("group").agg(*exprs).cache()
So far the code works fine and I can check df3:
>>> df3.collect()
[Row(group=u'a', total=4, cr=0.5), Row(group=u'b', total=3, cr=0.6666666666666666)]
However, when I try:
df3.agg(sum(col('cr'))).first()[0]
PySpark can't calculate that sum. However df3.rdd.reduce(lambda x,y: x[2]+y[2]) works just fine.
So, what is the issue with the first command to calculate the sum?
You should import pyspark's sum function first: from pyspark.sql.functions import sum. Otherwise python's built-in sum is called which just sums the sequence of numbers.

Resources