SparkSQL Aggregator: MissingRequirementError - apache-spark

I am trying to use Apache Spark's 2.0 Datasets:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import spark.implicits._
case class C1(f1: String, f2: String, f3: String, f4: String, f5: Double)
val teams = Seq(
C1("hash1", "NLC", "Cubs", "2016-01-23", 3253.21),
C1("hash1", "NLC", "Cubs", "2014-01-23", 353.88),
C1("hash3", "NLW", "Dodgers", "2013-08-15", 4322.12),
C1("hash4", "NLE", "Red Sox", "2010-03-14", 10283.72)
).toDS()
val c1Agg = new Aggregator[C1, Seq[C1], Seq[C1]] with Serializable {
def zero: Seq[C1] = Seq.empty[C1] //Nil
def reduce(b: Seq[C1], a: C1): Seq[C1] = b :+ a
def merge(b1: Seq[C1], b2: Seq[C1]): Seq[C1] = b1 ++ b2
def finish(r: Seq[C1]): Seq[C1] = r
override def bufferEncoder: Encoder[Seq[C1]] = newProductSeqEncoder[C1]
override def outputEncoder: Encoder[Seq[C1]] = newProductSeqEncoder[C1]
}.toColumn
val g_c1 = teams.groupByKey(_.f1).agg(c1Agg).collect
But then when I run it I got the following error message:
scala.reflect.internal.MissingRequirementError: class lineb4c2bb72bf6e417e9975d1a65602aec912.$read in JavaMirror with sun.misc.Launcher$AppClassLoader#14dad5dc of type class sun.misc.Launcher$AppClassLoader with class path [OMITTED] not found
I am assuming the configuration is correct because I am running under Databricks community cloud.

I was finally able to make it work by using ExpressionEncoder() rather than newProductSeqEncoder[C1] in lines 20, 21.
(Not sure why the previous code did not work though.)

Related

How to use Aggregator with a Row input in Spark SQL?

val testAgg = new Aggregator[Row, Int, Int] {
def zero = 0
def reduce(buffer: Int, row: Row):Int = buffer + 1
def merge(b1: Int, b2: Int):Int = b1 + b2
def finish(b: Int): Int = b
def bufferEncoder: Encoder[Int] = Encoders.scalaInt
def outputEncoder: Encoder[Int] = Encoders.scalaInt
}
I know how to use this aggregator in DataFrame, but how to use it in SQL?
spark.udf.register("testAgg", functions.udaf(testAgg))
spark.sql("SELECT testAgg(*) from mytable2").show() // not work
spark.sql("SELECT testAgg(struct(*)) from mytable2").show() // not work either
Error: No applicable constructor/method found for zero actual parameters; candidates are: "public org.apache.spark.sql.Row org.apache.spark.sql.Row$.apply(scala.collection.Seq)"
So is it possible to use an Aggregator with Row as input in SQL?

Get name / alias of column in PySpark

I am defining a column object like this:
column = F.col('foo').alias('bar')
I know I can get the full expression using str(column).
But how can I get the column's alias only?
In the example, I'm looking for a function get_column_name where get_column_name(column) returns the string bar.
One way is through regular expressions:
from pyspark.sql.functions import col
column = col('foo').alias('bar')
print(column)
#Column<foo AS `bar`>
import re
print(re.findall("(?<=AS `)\w+(?=`>$)", str(column)))[0]
#'bar'
Alternatively, we could use a wrapper function to tweak the behavior of Column.alias and Column.name methods to store the alias only in an AS attribute:
from pyspark.sql import Column, SparkSession
from pyspark.sql.functions import col, explode, array, struct, lit
SparkSession.builder.getOrCreate()
def alias_wrapper(self, *alias, **kwargs):
renamed_col = Column._alias(self, *alias, **kwargs)
renamed_col.AS = alias[0] if len(alias) == 1 else alias
return renamed_col
Column._alias, Column.alias, Column.name, Column.AS = Column.alias, alias_wrapper, alias_wrapper, None
which then guarantees:
assert(col("foo").alias("bar").AS == "bar")
# `name` should act like `alias`
assert(col("foo").name("bar").AS == "bar")
# column without alias should have None in `AS`
assert(col("foo").AS is None)
# multialias should be handled
assert(explode(array(struct(lit(1), lit("a")))).alias("foo", "bar").AS == ("foo", "bar"))
Regex is not needed. For PySpark 3.x it looks like backticks were replaced with quotes, so this might not work out of the box on earlier spark versions, but should be easy enough to modify.
from pyspark.sql import Column
def get_column_name(col: Column) -> str:
"""
PySpark doesn't allow you to directly access the column name with respect to aliases
from an unbound column. We have to parse this out from the string representation.
This works on columns with one or more aliases as well as unaliased columns.
Returns:
Col name as str, with respect to aliasing
"""
c = str(col).removeprefix("Column<'").removesuffix("'>")
return c.split(' AS ')[-1]
Some tests to validate behavior:
import pytest
from pyspark.sql import SparkSession
#pytest.fixture(scope="session")
def spark() -> SparkSession:
# Provide a session spark fixture for all tests
yield SparkSession.builder.getOrCreate()
def test_get_col_name(spark):
col = f.col('a')
actual = get_column_name(col)
assert actual == 'a'
def test_get_col_name_alias(spark):
col = f.col('a').alias('b')
actual = get_column_name(col)
assert actual == 'b'
def test_get_col_name_multiple_alias(spark):
col = f.col('a').alias('b').alias('c')
actual = get_column_name(col)
assert actual == 'c'
def test_get_col_name_longer(spark: SparkSession):
"""Added this test due to identifying a bug in the old implementation (if you use lstrip/rstrip, this will fail)"""
col = f.col("local")
actual = get_column_name(col)
assert actual == "local"
I've noticed that in some systems you may have backticks surrounding column names. The following options work both with backticks and without.
Option 1 (no regex): str(col).replace("`", "").split("'")[-2].split(" AS ")[-1])
from pyspark.sql.functions import col
col_1 = col('foo')
col_2 = col('foo').alias('bar')
col_3 = col('foo').alias('bar').alias('baz')
s = str(col_1)
print(col_1)
print(s.replace("`", "").split("'")[-2].split(" AS ")[-1])
# Column<'foo'>
# foo
s = str(col_2)
print(col_2)
print(s.replace("`", "").split("'")[-2].split(" AS ")[-1])
# Column<'foo AS bar'>
# bar
s = str(col_3)
print(col_3)
print(s.replace("`", "").split("'")[-2].split(" AS ")[-1])
# Column<'foo AS bar AS baz'>
# baz
Option 2 (regex): pattern '.*?`?(\w+)`?' looks safe enough:
re.search(r"'.*?`?(\w+)`?'", str(col)).group(1)
from pyspark.sql.functions import col
col_1 = col('foo')
col_2 = col('foo').alias('bar')
col_3 = col('foo').alias('bar').alias('baz')
import re
print(col_1)
print(re.search(r"'.*?`?(\w+)`?'", str(col_1)).group(1))
# Column<'foo'>
# foo
print(col_2)
print(re.search(r"'.*?`?(\w+)`?'", str(col_2)).group(1))
# Column<'foo AS bar'>
# bar
print(col_3)
print(re.search(r"'.*?`?(\w+)`?'", str(col_3)).group(1))
# Column<'foo AS bar AS baz'>
# baz

Spark custom aggregation : collect_list+UDF vs UDAF

I often have the need to perform custom aggregations on dataframes in spark 2.1, and used these two approaches :
Using groupby/collect_list to get all the values in a single row, then apply an UDF to aggregate the values
Writing a custom UDAF (User defined aggregate function)
I generally prefer the first option as its easier to implement and more readable than the UDAF implementation. But I would assume that the first option is generally slower, because more data is sent around the network (no partial aggregation), but my experience shows that UDAF are generally slow. Why is that?
Concrete example: Calculating histograms:
Data is in a hive table (1E6 random double values)
val df = spark.table("testtable")
def roundToMultiple(d:Double,multiple:Double) = Math.round(d/multiple)*multiple
UDF approach:
val udf_histo = udf((xs:Seq[Double]) => xs.groupBy(x => roundToMultiple(x,0.25)).mapValues(_.size))
df.groupBy().agg(collect_list($"x").as("xs")).select(udf_histo($"xs")).show(false)
+--------------------------------------------------------------------------------+
|UDF(xs) |
+--------------------------------------------------------------------------------+
|Map(0.0 -> 125122, 1.0 -> 124772, 0.75 -> 250819, 0.5 -> 248696, 0.25 -> 250591)|
+--------------------------------------------------------------------------------+
UDAF-Approach
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
import scala.collection.mutable
class HistoUDAF(binWidth:Double) extends UserDefinedAggregateFunction {
override def inputSchema: StructType =
StructType(
StructField("value", DoubleType) :: Nil
)
override def bufferSchema: StructType =
new StructType()
.add("histo", MapType(DoubleType, IntegerType))
override def deterministic: Boolean = true
override def dataType: DataType = MapType(DoubleType, IntegerType)
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = Map[Double, Int]()
}
private def mergeMaps(a: Map[Double, Int], b: Map[Double, Int]) = {
a ++ b.map { case (k,v) => k -> (v + a.getOrElse(k, 0)) }
}
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val oldBuffer = buffer.getAs[Map[Double, Int]](0)
val newInput = Map(roundToMultiple(input.getDouble(0),binWidth) -> 1)
buffer(0) = mergeMaps(oldBuffer, newInput)
}
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
val a = buffer1.getAs[Map[Double, Int]](0)
val b = buffer2.getAs[Map[Double, Int]](0)
buffer1(0) = mergeMaps(a, b)
}
override def evaluate(buffer: Row): Any = {
buffer.getAs[Map[Double, Int]](0)
}
}
val histo = new HistoUDAF(0.25)
df.groupBy().agg(histo($"x")).show(false)
+--------------------------------------------------------------------------------+
|histoudaf(x) |
+--------------------------------------------------------------------------------+
|Map(0.0 -> 125122, 1.0 -> 124772, 0.75 -> 250819, 0.5 -> 248696, 0.25 -> 250591)|
+--------------------------------------------------------------------------------+
My tests show that the collect_list/UDF approach is about 2 times faster than the UDAF approach. Is this a general rule, or are there cases where UDAF is really much faster and the rather awkward implemetation is justified?
UDAF is slower because it deserializes/serializes aggregator from/to internal buffer on each update -> on each row which is quite expensive (some more details). Instead you should use Aggregator (in fact, UDAF have been deprecated since Spark 3.0).

How to search for a sub string within a string using Pyspark

The image added contains sample of .
For example, if sentence contains "John" and "drives" it means John has a car and to get to work he drives. I'm attaching code I'm using to do it. However, the code doesn't work correctly and is too complicated. I will appreciate your help.
%pyspark
rdd = sc.textFile("./sample.txt")
col = rdd.map(lambda line: line.split('\t'))
#remove header
header = col.first() #extract header
col = col.filter(lambda line: line != header)
def convertToRow(line):
return Row(Name = line[0],Text = line[1])
#call the function on each row, then convert to dataframe
df = col.map(convertToRow).toDF()
from pyspark.sql.functions import udf
def splitParagraphIntoSentences(paragraph):
sentences = nltk.tokenize.sent_tokenize(paragraph)
return sentences
def tokenize(text):
text = text.lower().replace('\n', '')
text = re.sub(',', '', text)
tokens = text.split()
if(len(tokens)>1):
tokens = splitParagraphIntoSentences(text)
return tokens
tokenize = udf(lambda text: tokenize(text))
data = df.select('Name', tokenize(df.Text).alias("Text"))
def how(name,paragraph):
drive = ['drives']
walks = ['walks']
comingwith = ['coming with']
for s in paragraph:
s = s.split()
if ((any(s[i:i+len(drive)]==drive for i in xrange(len(s)-len(drive)+1))) and (any(s[i:i+len(name)]==name for i in xrange(len(s)-len(name)+1)))):
return "Drives"
elif ((any(s[i:i+len(walks)]==walks for i in xrange(len(s)-len(walks)+1))) and (any(s[i:i+len(name)]==name for i in xrange(len(s)-len(name)+1)))):
return "Walks"
elif ((any(s[i:i+len(comingwith)]==comingwith for i in xrange(len(s)-len(comingwith)+1))) and (any(s[i:i+len(name)]==name for i in xrange(len(s)-len(name)+1)))):
return "Coming with"
def checkYesNo(name,paragraph):
drive = ['drives']
walks = ['walks']
comingwith = ['coming with']
for s in paragraph:
s = s.split()
if ((any(s[i:i+len(comingwith)]==comingwith for i in xrange(len(s)-len(comingwith)+1))) or (any(s[i:i+len(walks)]==walks for i in xrange(len(s)-len(walks)+1)))):
return "No"
else:
return "Yes"
how = udf(lambda name,paragraph: how(name,paragraph))
checkYesNo = udf(lambda name,paragraph: checkYesNo(name,paragraph))
final_df = data.select('Name', checkYesNo(data.Name, data.Text), how(data.Name, data.Text))
I'd do it like this:
import socket
class SparkUtil(object):
#staticmethod
def get_spark_context (host, venv, framework_name, parts):
os.environ['PYSPARK_PYTHON'] = "{0}/bin/python".format (venv)
from pyspark import SparkConf, SparkContext
from StringIO import StringIO
ip = socket.gethostbyname(socket.gethostname())
sparkConf = (SparkConf()
.setMaster(host)
.setAppName(framework_name))
return SparkContext(conf = sparkConf)
input_txt = [
[ "John", "John usually drives to work. He usually gets up early and drinks coffee. Mary usually joining him." ],
[ "Sam", "As opposed to John, Sam doesn't like to drive. Sam usually walks there." ],
[ "Mary", "Mary doesn't have driving license. Mary usually coming with John which picks her up from home." ]
]
def has_car (text):
return "drives" in text
def get_method (text):
method = None
for m in [ "drives", "walks", "coming with" ]:
if m in text:
method = m
break
return method
def process_row (row):
return [ row[0], has_car(row[1]), get_method(row[1]) ]
sc = SparkUtil.get_spark_context (host = "local[2]",
venv = "../starshome/venv",
framework_name = "app",
parts = 2)
print (sc.parallelize (input_txt).map (process_row).collect ())
The SparkUtil class you can probably ignore. I'm not using a notebook. This is just a straight up Spark app.

Compounding in Spark

I have a dataframe of this format
Date | Return
01/01/2015 0.0
02/02/2015 -0.02
03/02/2015 0.05
04/02/2015 0.07
I would like to do compounding and add a column which will return Compounded return. Compounded return is calculated as:
1 for 1st row.
(1+Return(i))* Compounded(i-1))
So my df finally will be
Date | Return | Compounded
01/01/2015 0.0 1.0
02/02/2015 -0.02 1.0*(1-0.2)=0.8
03/02/2015 0.05 0.8*(1+0.05)=0.84
04/02/2015 0.07 0.84*(1+0.07)=0.8988
Answers in Java will be highly appreciated.
You can also create a custom aggregate function and use it in a window function.
Something like this (writing freeform so there probably would be some mistakes):
package com.myuadfs
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
class MyUDAF() extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(Array(StructField("Return", DoubleType)))
def bufferSchema = StructType(StructField("compounded", DoubleType))
def dataType: DataType = DoubleType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer(0) = 1.0 // set compounded to 1
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
buffer(0) = buffer.getDouble(0) * ( input.getDouble(0) + 1)
}
// this generally merges two aggregated buffers. This means this
// would not have worked properly had you been working with a regular
// aggregate but since you are planning to use this inside a window
// only this should not be called at all.
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0)
}
def evaluate(buffer: Row) = {
buffer.getDouble(0)
}
}
Now you can use this inside a window function. Something like this:
import org.apache.spark.sql.Window
val windowSpec = Window.orderBy("date")
val newDF = df.withColumn("compounded", df("Return").over(windowSpec)
Note that this has the limitation that the entire calculation should fit in a single partition so if you have too large a data you would have a problem. That said, nominally this kind of operations are performed after some partitioning by key (e.g. add a partitionBy to the window) and then a single element should be part of a key.
First, we define a function f(line) (suggest a better name, please!!) to process the lines.
def f(line):
global firstLine
global last_compounded
if line[0] == 'Date':
firstLine = True
return (line[0], line[1], 'Compounded')
else:
firstLine = False
if firstLine:
last_compounded = 1
firstLine = False
else:
last_compounded = (1+float(line[1]))*last_compounded
return (line[0], line[1], last_compounded)
Using two global variables (could be improved?), we keep the Compounded(i-1) value and if we are processing the first line.
With your data in some_file, a solution could be:
rdd = sc.textFile('some_file').map(lambda l: l.split())
r1 = rdd.map(lambda l: f(l))
rdd.collect()
[[u'Date', u'Return'], [u'01/01/2015', u'0.0'], [u'02/02/2015', u'-0.02'], [u'03/02/2015', u'0.05'], [u'04/02/2015', u'0.07']]
r1.collect()
[(u'Date', u'Return', 'Compounded'), (u'01/01/2015', u'0.0', 1.0), (u'02/02/2015', u'-0.02', 0.98), (u'03/02/2015', u'0.05', 1.05), (u'04/02/2015', u'0.07', 1.1235000000000002)]

Resources