User defined function to be applied to Window in PySpark? - apache-spark

I am trying to apply a user defined function to Window in PySpark. I have read that UDAF might be the way to to go, but I was not able to find anything concrete.
To give an example (taken from here: Xinh's Tech Blog and modified for PySpark):
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import avg
spark = SparkSession.builder.master("local").config(conf=SparkConf()).getOrCreate()
a = spark.createDataFrame([[1, "a"], [2, "b"], [3, "c"], [4, "d"], [5, "e"]], ['ind', "state"])
customers = spark.createDataFrame([["Alice", "2016-05-01", 50.00],
["Alice", "2016-05-03", 45.00],
["Alice", "2016-05-04", 55.00],
["Bob", "2016-05-01", 25.00],
["Bob", "2016-05-04", 29.00],
["Bob", "2016-05-06", 27.00]],
["name", "date", "amountSpent"])
customers.show()
window_spec = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
result = customers.withColumn( "movingAvg", avg(customers["amountSpent"]).over(window_spec))
result.show()
I am applying avg which is already built into pyspark.sql.functions, but if instead of avg I wanted to use something of more complicated and write my own function, how would I do that?

Spark >= 3.0:
SPARK-24561 - User-defined window functions with pandas udf (bounded window) is a a work in progress. Please follow the related JIRA for details.
Spark >= 2.4:
SPARK-22239 - User-defined window functions with pandas udf (unbounded window) introduced support for Pandas based window functions with unbounded windows. General structure is
return_type: DataType
#pandas_udf(return_type, PandasUDFType.GROUPED_AGG)
def f(v):
return ...
w = (Window
.partitionBy(grouping_column)
.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
df.withColumn('foo', f('bar').over(w))
Please see the doctests and the unit tests for detailed examples.
Spark < 2.4
You cannot. Window functions require UserDefinedAggregateFunction or equivalent object, not UserDefinedFunction, and it is not possible to define one in PySpark.
However, in PySpark 2.3 or later, you can define vectorized pandas_udf, which can be applied on grouped data. You can find a working example Applying UDFs on GroupedData in PySpark (with functioning python example). While Pandas don't provide direct equivalent of window functions, there are expressive enough to implement any window-like logic, especially with pandas.DataFrame.rolling. Furthermore function used with GroupedData.apply can return arbitrary number of rows.
You can also call Scala UDAF from PySpark Spark: How to map Python with Scala or Java User Defined Functions?.

UDFs can be applied to Window now as of Spark 3.0.0.
https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.pandas_udf.html
Extract from the documentation:
from pyspark.sql import Window
#pandas_udf("double")
def mean_udf(v: pd.Series) -> float:
return v.mean()
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v"))
w = Window.partitionBy('id').orderBy('v').rowsBetween(-1, 0)
df.withColumn('mean_v', mean_udf("v").over(w)).show()
+---+----+------+
| id| v|mean_v|
+---+----+------+
| 1| 1.0| 1.0|
| 1| 2.0| 1.5|
| 2| 3.0| 3.0|
| 2| 5.0| 4.0|
| 2|10.0| 7.5|
+---+----+------+

Related

How to randomly shuffle the values of only one column in pyspark?

I want to break the correlation between a column and the rest of the dataframe. I want to do this while maintaining the distribution of the values in the said column
In pandas, I used to achieve this by simply shuffling the values of a column and then assigning the values to the column. It is not so straightforward in the case of pyspark because of the data being partitioned. I do not think there is even a way in pyspark to set a new column in a dataframe with a column from another dataframe
So, In pyspark, how do I achieve the following?:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
Also, I'm hoping you give me a solution that does not include unnecessary shuffles.
one way to do it is:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import row_number,lit,rand
from pyspark.sql.window import Window
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
dfs = spark.createDataFrame(df)
w = Window().orderBy(lit('A'))
dfs = dfs.withColumn("row_num", row_number().over(w))
dfs_ts = dfs.select('b')
dfs_ts = dfs_ts.withColumn('o',rand()).orderBy('o')
dfs = dfs.drop('b')
dfs_ts = dfs_ts.drop('o')
w = Window().orderBy(lit('A'))
dfs_ts = dfs_ts.withColumn("row_num", row_number().over(w))
dfs = dfs.join(dfs_ts,on='row_num').drop('row_num')
But, I do not need the shuffles that come with join and they are not necessary. If a blind hstack is possible per partition basis in pyspark that should be enough. Also, the window function tells me that I have not defined any partitions so all my data would be collected to one partition. Might as well use pandas in that case
I think you could achieve something like that on the RDD level. Not sure if it would be worth all the extra steps, and depending on the size of the partitions it might not perform really well because you'd have to hold all the values of the partition in memory for the shuffle.
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
import pandas as pd
df = pd.DataFrame({'a':[x for x in range(10)],'b':[x for x in range(10)]})
dfs = spark.createDataFrame(df)
dfs.show()
rdd_a = dfs.select("a").rdd
rdd_b = dfs.select("b").rdd
from random import shuffle
def f(iterator):
items = list(iterator)
shuffle(items)
for x in items:
yield x
from pyspark import Row
def reconstruct(x):
return Row(a=x[0].a, b=x[1].b)
rdd_b_reordered = rdd_b.mapPartitions(f)
df_reordered = spark.createDataFrame(rdd_a.zip(rdd_b_reordered).map(reconstruct))
df_reordered.show()
"""
My output:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 4|
| 2| 0|
| 3| 1|
| 4| 3|
| 5| 7|
| 6| 9|
| 7| 5|
| 8| 6|
| 9| 8|
+---+---+
"""
Maybe you can tweak that to your needs. This would only shuffle the things over each partition to avoid the partition shuffle. Also probably better to do it in Scala.
I am not fully confident in this answer, but I think it's right.
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
So if we get your Pandas example:
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4],'b':[3,4,5,6]})
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
We can just wrap the .sample expression in a function.
def shuffle(df: pd.DataFrame) -> pd.DataFrame:
df['b'] = df['b'].sample(frac=1).reset_index(drop=True)
return df
And then we can bring it to Spark using the transform function. The ``transform` function can take in both Pandas and Spark DataFrames and then will convert it to Spark if you are using the Spark engine.
from fugue import transform
import fugue_spark
transform(df, shuffle, schema="*", engine="spark")
But of course, this transform is applied per partition. If you don't specify the partitions, it uses the default partitions. If you want to shuffle to really randomize it, you can do:
transform(df, shuffle, schema="*", engine="spark", partition={"algo":"rand"}).show()
And Fugue will partition your data randomly before this operation.
You may not see the shuffle for your test case because the data is really small. If you end up with 4 partitions with 1 row each, they will end up returning the same value.

How to update pyspark dataframe metadata on Spark 2.1?

I'm facing an issue with the OneHotEncoder of SparkML since it reads dataframe metadata in order to determine the value range it should assign for the sparse vector object its creating.
More specifically, I'm encoding a "hour" field using a training set containing all individual values between 0 and 23.
Now I'm scoring a single row data frame using the "transform" method od the Pipeline.
Unfortunately, this leads to a differently encoded sparse vector object for the OneHotEncoder
(24,[5],[1.0]) vs. (11,[10],[1.0])
I've documented this here, but this was identified as duplicate. So in this thread there is a solution posted to update the dataframes's metadata to reflect the real range of the "hour" field:
from pyspark.sql.functions import col
meta = {"ml_attr": {
"vals": [str(x) for x in range(6)], # Provide a set of levels
"type": "nominal",
"name": "class"}}
loaded.transform(
df.withColumn("class", col("class").alias("class", metadata=meta)) )
Unfortunalely I get this error:
TypeError: alias() got an unexpected keyword argument 'metadata'
In PySpark 2.1, the alias method has no argument metadata (docs) - this became available in Spark 2.2; nevertheless, it is still possible to modify column metadata in PySpark < 2.2, thanks to the incredible Spark Gotchas, maintained by #eliasah and #zero323:
import json
from pyspark import SparkContext
from pyspark.sql import Column
from pyspark.sql.functions import col
spark.version
# u'2.1.1'
df = sc.parallelize((
(0, "x", 2.0),
(1, "y", 3.0),
(2, "x", -1.0)
)).toDF(["label", "x1", "x2"])
df.show()
# +-----+---+----+
# |label| x1| x2|
# +-----+---+----+
# | 0| x| 2.0|
# | 1| y| 3.0|
# | 2| x|-1.0|
# +-----+---+----+
Supposing that we want to enforce the possibility of our label data to be between 0 and 5, despite that in our dataframe are between 0 and 2, here is how we should modify the column metadata:
def withMeta(self, alias, meta):
sc = SparkContext._active_spark_context
jmeta = sc._gateway.jvm.org.apache.spark.sql.types.Metadata
return Column(getattr(self._jc, "as")(alias, jmeta.fromJson(json.dumps(meta))))
Column.withMeta = withMeta
# new metadata:
meta = {"ml_attr": {"name": "label_with_meta",
"type": "nominal",
"vals": [str(x) for x in range(6)]}}
df_with_meta = df.withColumn("label_with_meta", col("label").withMeta("", meta))
Kudos also to this answer by zero323!

convert dataframe to libsvm format

I have a dataframe resulting from a sql query
df1 = sqlContext.sql("select * from table_test")
I need to convert this dataframe to libsvm format so that it can be provided as an input for
pyspark.ml.classification.LogisticRegression
I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2
df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm
I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error
import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils
Question 1: Is my above approach to convert dataframe to libsvm format in the right direction.
Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format
I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):
This is my way to convert dataframe to libsvm format:
# ... your previous imports
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint
# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| 1| 3| 6|
| 4| 5| 20|
| 7| 8| 8|
+---+---+---+
# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]
# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]
# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")
What you will see on the "/your/Path/nameFolder/part-0000*" files is:
1.0 1:3.0 2:6.0
4.0 1:5.0 2:20.0
7.0 1:8.0 2:8.0
See here for LabeledPoint docs
I had to do this for it to work
D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))
If you want to convert sparse vectors to a 'sparse' libsvm which is more efficient, try this:
from pyspark.ml.linalg import Vectors
from pyspark.mllib.linalg import Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
df = spark.createDataFrame([
(0, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)])),
(1, Vectors.sparse(5, [(1, 1.0), (3, 7.0)]))
], ["label", "features"])
df.show()
# +-----+-------------------+
# |label| features|
# +-----+-------------------+
# | 0|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# | 1|(5,[1,3],[1.0,7.0])|
# +-----+-------------------+
MLUtils.saveAsLibSVMFile(df.rdd.map(lambda x: LabeledPoint(x.label, MLLibVectors.fromML(x.features))), './libsvm')

Find median in spark SQL for multiple double datatype columns

I have a requirement to find median for multiple double datatype columns.Request suggestion to find the correct approach.
Below is my sample dataset with one column. I am expecting the median value to be returned as 1 for my sample.
scala> sqlContext.sql("select num from test").show();
+---+
|num|
+---+
|0.0|
|0.0|
|1.0|
|1.0|
|1.0|
|1.0|
+---+
I tried the following options
1) Hive UDAF percentile, it worked only for BigInt.
2) Hive UDAT percentile_approx, but it does not work as expected (returns 0.25 vs 1).
sqlContext.sql("select percentile_approx(num,0.5) from test").show();
+----+
| _c0|
+----+
|0.25|
+----+
3) Spark window function percent_rank- to find median the way i see is to look for all percent_rank above 0.5 and pick the max percent_rank's corresponding num value. But it does not work in all cases, especially when i have even record counts, in such case the median is the average of the middle value in the sorted distribution.
Also in the percent_rank, as i have to find the median for multiple columns, i have to calculate it in different dataframes, which to me is little complex method. Please correct me, if my understanding is not right.
+---+-------------+
|num|percent_rank |
+---+-------------+
|0.0|0.0|
|0.0|0.0|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
|1.0|0.4|
+---+---+
Which version of Apache Spark are you using out of curiosity? There were some fixes within the Apache Spark 2.0+ which included changes to approxQuantile.
If I was to run the pySpark code snippet below:
rdd = sc.parallelize([[1, 0.0], [1, 0.0], [1, 1.0], [1, 1.0], [1, 1.0], [1, 1.0]])
df = rdd.toDF(['id', 'num'])
df.createOrReplaceTempView("df")
with the median calculation using approxQuantile as:
df.approxQuantile("num", [0.5], 0.25)
or
spark.sql("select percentile_approx(num, 0.5) from df").show()
the results are:
Spark 2.0.0: 0.25
Spark 2.0.1: 1.0
Spark 2.1.0: 1.0
Note, as these are the approximate numbers (via approxQuantile) though in general this should work well. If you need the exact median, one approach is to use numpy.median. The code snippet below is updated for this df example based on gench's SO response to How to find the median in Apache Spark with Python Dataframe API?:
from pyspark.sql.types import *
import pyspark.sql.functions as F
import numpy as np
def find_median(values):
try:
median = np.median(values) #get the median of values in a list in each row
return round(float(median),2)
except Exception:
return None #if there is anything wrong with the given values
median_finder = F.udf(find_median,FloatType())
df2 = df.groupBy("id").agg(F.collect_list("num").alias("nums"))
df2 = df2.withColumn("median", median_finder("nums"))
# print out
df2.show()
with the output of:
+---+--------------------+------+
| id| nums|median|
+---+--------------------+------+
| 1|[0.0, 0.0, 1.0, 1...| 1.0|
+---+--------------------+------+
Updated: Spark 1.6 Scala version using RDDs
If you are using Spark 1.6, you can calculate the median using Scala code via Eugene Zhulenev's response How can I calculate the exact median with Apache Spark. Below is the modified code that works with our example.
import org.apache.spark.SparkContext._
val rdd: RDD[Double] = sc.parallelize(Seq((0.0), (0.0), (1.0), (1.0), (1.0), (1.0)))
val sorted = rdd.sortBy(identity).zipWithIndex().map {
case (v, idx) => (idx, v)
}
val count = sorted.count()
val median: Double = if (count % 2 == 0) {
val l = count / 2 - 1
val r = l + 1
(sorted.lookup(l).head + sorted.lookup(r).head).toDouble / 2
} else sorted.lookup(count / 2).head.toDouble
with the output of:
// output
import org.apache.spark.SparkContext._
rdd: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[227] at parallelize at <console>:34
sorted: org.apache.spark.rdd.RDD[(Long, Double)] = MapPartitionsRDD[234] at map at <console>:36
count: Long = 6
median: Double = 1.0
Note, this is calculating the exact median using RDDs - i.e. you will need to convert the DataFrame column into an RDD to perform this calculation.

Get first non-null values in group by (Spark 1.6)

How can I get the first non-null values from a group by? I tried using first with coalesce F.first(F.coalesce("code")) but I don't get the desired behavior (I seem to get the first row).
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
I tried:
(df
.groupby("id")
.agg(F.first(F.coalesce("code")),
F.first(F.coalesce("name")))
.collect())
DESIRED OUTPUT
[Row(id='a', code='code1', name='name2')]
For Spark 1.3 - 1.5, this could do the trick:
from pyspark.sql import functions as F
df.groupBy(df['id']).agg(F.first(df['code']), F.first(df['name'])).show()
+---+-----------+-----------+
| id|FIRST(code)|FIRST(name)|
+---+-----------+-----------+
| a| code1| name2|
+---+-----------+-----------+
Edit
Apparently, in version 1.6 they have changed the way the first aggregate function is processed. Now, the underlying class First should be constructed with a second argument ignoreNullsExpr parameter, which is not yet used by the first aggregate function (as can bee seen here). However, in Spark 2.0 it will be able to call agg(F.first(col, True)) to ignore nulls (as can be checked here).
Therefore, for Spark 1.6 the approach must be different and a little more inefficient, unfornately. One idea is the following:
from pyspark.sql import functions as F
df1 = df.select('id', 'code').filter(df['code'].isNotNull()).groupBy(df['id']).agg(F.first(df['code']))
df2 = df.select('id', 'name').filter(df['name'].isNotNull()).groupBy(df['id']).agg(F.first(df['name']))
result = df1.join(df2, 'id')
result.show()
+---+-------------+-------------+
| id|first(code)()|first(name)()|
+---+-------------+-------------+
| a| code1| name2|
+---+-------------+-------------+
Maybe there is a better option. I'll edit the answer if I find one.
Because I only had one non-null value for every grouping, using min / max in 1.6 worked for my purposes:
(df
.groupby("id")
.agg(F.min("code"),
F.min("name"))
.show())
+---+---------+---------+
| id|min(code)|min(name)|
+---+---------+---------+
| a| code1| name2|
+---+---------+---------+
The first method accept an argument ignorenulls, that can be set to true,
Python:
df.groupby("id").agg(first(col("code"), ignorenulls=True).alias("code"))
Scala:
df.groupBy("id").agg(first(col("code"), ignoreNulls = true).alias("code"))

Resources