Spark SQL - Aggregate collections? - apache-spark

Let's say I have 2 data frames.
DF1 may have values {3, 4, 5} in column A of various rows.
DF2 may have values {4, 5, 6} in column A of various rows.
I can aggregate these into a set of distinct elements using distinct_set(A), assuming all those rows fall into the same grouping.
At this point I have a set in the resulting data frame. Is there anyway to aggregate that set with another set? Basically, if I have 2 data frames resulting from the first aggregation, I want to be able to aggregate their results.

While explode and collect_set could solve this, it made more sense just to write a custom aggregator to merge the sets themselves. The structure underlying them is a WrappedArray.
case class SetMergeUDAF() extends UserDefinedAggregateFunction {
def deterministic: Boolean = false
def inputSchema: StructType = StructType(StructField("input", ArrayType(LongType)) :: Nil)
def bufferSchema: StructType = StructType(StructField("buffer", ArrayType(LongType)) :: Nil)
def dataType: DataType = ArrayType(LongType)
def initialize(buf: MutableAggregationBuffer): Unit = {
buf(0) = mutable.WrappedArray.empty[LongType]
}
def update(buf: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
val result : mutable.WrappedArray[LongType] = mutable.WrappedArray.empty[LongType]
val x = result ++ (buf.getAs[mutable.WrappedArray[Long]](0).toSet ++ input.getAs[mutable.WrappedArray[Long]](0).toSet).toArray[Long]
buf(0) = x
}
}
def merge(buf1: MutableAggregationBuffer, buf2: Row): Unit = {
val result : mutable.WrappedArray[LongType] = mutable.WrappedArray.empty[LongType]
val x = result ++ (buf1.getAs[mutable.WrappedArray[Long]](0).toSet ++ buf2.getAs[mutable.WrappedArray[Long]](0).toSet).toArray[Long]
buf1(0) = x
}
def evaluate(buf: Row): Any = buf.getAs[mutable.WrappedArray[LongType]](0)
}

Related

How to pass an array to an User Defined Aggregation Function in Spark (UDAF)

I'd like to pass an Array as input schema in an UDAF.
The example I give is pretty simple, it just sums 2 vectors. Actually my use case is more complexe and I need to use an UDAF.
import sc.implicits._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions._
val df = Seq(
(1, Array(10.2, 12.3, 11.2)),
(1, Array(11.2, 12.6, 10.8)),
(2, Array(12.1, 11.2, 10.1)),
(2, Array(10.1, 16.0, 9.3))
).toDF("siteId", "bidRevenue")
class BidAggregatorBySiteId() extends UserDefinedAggregateFunction {
def inputSchema: StructType = StructType(Array(StructField("bidRevenue", ArrayType(DoubleType))))
def bufferSchema = StructType(Array(StructField("sumArray", ArrayType(DoubleType))))
def dataType: DataType = ArrayType(DoubleType)
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, Array(0.0, 0.0, 0.0))
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
val seqBuffer = buffer(0).asInstanceOf[IndexedSeq[Double]]
val seqInput = input(0).asInstanceOf[IndexedSeq[Double]]
buffer(0) = seqBuffer.zip(seqInput).map{ case (x, y) => x + y }
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val seqBuffer1 = buffer1(0).asInstanceOf[IndexedSeq[Double]]
val seqBuffer2 = buffer2(0).asInstanceOf[IndexedSeq[Double]]
buffer1(0) = seqBuffer1.zip(seqBuffer2).map{ case (x, y) => x + y }
}
def evaluate(buffer: Row) = {
buffer
}
}
val fun = new BidAggregatorBySiteId()
df.select($"siteId", $"bidRevenue" cast(ArrayType(DoubleType)))
.groupBy("siteId").agg(fun($"bidRevenue"))
.show
All works fine for transformations before the "show" action. But the show raises the error:
scala.MatchError: [WrappedArray(21.4, 24.9, 22.0)] (of class org.apache.spark.sql.execution.aggregate.InputAggregationBuffer)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$ArrayConverter.toCatalystImpl(CatalystTypeConverters.scala:160)
The structure of my dataframe is :
root
|-- siteId: integer (nullable = false)
|-- bidRevenue: array (nullable = true)
| |-- element: double (containsNull = true)
df.dtypes = Array[(String, String)] = Array(("siteId", "IntegerType"), ("bidRevenue", "ArrayType(DoubleType,true)"))
Tanks for you valuable help.
def evaluate(buffer: Row): Any
Above method is called once a group is processed completely to get the final result.
As you are initializing and updating only buffer's 0th index
i.e. buffer(0)
So you need to return the 0th index value at the end as your aggregated results are stored at 0 index.
def evaluate(buffer: Row) = {
buffer.get(0)
}
Above modification to evaluate() method will result in:
// +------+---------------------------------+
// |siteId|bidaggregatorbysiteid(bidRevenue)|
// +------+---------------------------------+
// | 1| [21.4, 24.9, 22.0]|
// | 2| [22.2, 27.2, 19.4]|
// +------+---------------------------------+

How to merge two dataframes and return data from another column in new column only if there is match?

I have a two df that look like this:
df1:
id
1
2
df2:
id value
2 a
3 b
How do I merge these two dataframes and only return the data from value column in a new column if there is a match?
new_merged_df
id value new_value
1
2 a a
3 b
You can try this using #JJFord3 setup:
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
#Use isin to create new_value
df2['new_value'] = df2['value'].where(df2.index.isin(df1.index))
#Use reindex with union to rebuild dataframe with both indexes
df2.reindex(df1.index.union(df2.index))
Output:
value new_value
1 NaN NaN
2 a a
3 b NaN
import pandas
df1 = pandas.DataFrame(index=[1,2])
df2 = pandas.DataFrame({'value' : ['a','b']},index=[2,3])
new_merged_df_outer = df1.merge(df2,how='outer',left_index=True,right_index=True)
new_merged_df_inner = df1.merge(df2,how='inner',left_index=True,right_index=True)
new_merged_df_inner.rename(columns={'value':'new_value'})
new_merged_df = new_merged_df_outer.merge(new_merged_df_inner,how='left',left_index=True,right_index=True)
First, create an outer merge to keep all indexes.
Then create an inner merge to only get the overlap.
Then merge the inner merge back to the outer merge to get the desired column setup.
You can use full outer join
Lets model your data with case classes:
case class MyClass1(id: String)
case class MyClass2(id: String, value: String)
// this one for the result type
case class MyClass3(id: String, value: Option[String] = None, value2: Option[String] = None)
Creating some inputs:
val input1: Dataset[MyClass1] = ...
val input2: Dataset[MyClass2] = ...
Joining your data:
import scala.implicits._
val joined = input1.as("1").joinWith(input2.as("2"), $"1.id" === $"2.id", "full_outer")
joined map {
case (left, null) if left != null => MyClass3(left.id)
case (null, right) if right != null => MyClass3(right.id, Some(right.value))
case (left, right) => MyClass3(left.id, Some(right.value), Some(right.value))
}
DataFrame.merge has in parameter indicator which
If True, adds a column to output DataFrame called “_merge” with information on the source of each row.
This can be used to check if there is a match
import pandas as pd
df1 = pd.DataFrame(index=[1,2])
df2 = pd.DataFrame({'value' : ['a','b']},index=[2,3])
# creates a new column `_merge` with values `right_only`, `left_only` or `both`
merged = df1.merge(df2, how='outer', right_index=True, left_index=True, indicator=True)
merged['new_value'] = merged.loc[(merged['_merge'] == 'both'), 'value']
merged = merged.drop('_merge', axis=1)
Use merge and isin:
df = df1.merge(df2,on='id',how='outer')
id_value = df2.loc[df2['id'].isin(df1.id.tolist()),'id'].unique()
mask = df['id'].isin(id_value)
df.loc[mask,'new_value'] = df.loc[mask,'value']
# alternative df['new_value'] = np.where(mask, df['value'], np.nan)
print(df)
id value new_value
0 1 NaN NaN
1 2 a a
2 3 b NaN

Iterative RDD/Dataframe processing in Spark

My ADLA solution is being transitioned to Spark. I'm trying to find the right replacement for U-SQL REDUCE expression to enable:
Read logical partition and store information in a list/dictionary/vector or other data structure in memory
Apply logic that requires multiple iterations
Output results as additional columns together with the original data (the original rows might be partially eliminated or duplicated)
Example of possible task:
Input dataset has sales and return transactions with their IDs and attributes
The solution is supposed finding the most likely sale for each return
Return transaction must happen after the sales transaction and be as similar to the sales transactions as possible (best available match)
Return transaction must be linked to exactly one sales transaction; sales transaction could be linked to one or no return transaction - link is supposed to be captured in the new column LinkedTransactionId
The solution could be probably achieved by groupByKey command, but I'm failing identify how to apply the logic across multiple rows. All examples I've managed to find are some variation of in-line function (usually an aggregate - e.g. .map(t => (t._1, t._2.sum))) which doesn't require information about individual records from the same partition.
Can anyone share example of similar solution or point me to the right direction?
Here is one possible solution - feedbacks and suggestions for different approach or examples of iterative Spark/Scala solutions are greatly appreciated:
Example will read Sales and Credit transactions for each customer (CustomerId) and process each customer as separate partition (outer mapPartition loop)
Credit will be mapped to the sales with closest score (i.e. smallest score difference - using the foreach inner loop inside each partition)
Mutable map trnMap is preventing double-assignmet of each transactions and captures updates from the process
Results are outputted thru an iterator as into final dataset dfOut2
Note: in this particular case the same result could have been achieved using windowing functions w/o using iterative solution, but the purpose is to test the iterative logic itself)
import org.apache.spark.sql.SparkSession
import org.apache.spark._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.api.java.JavaRDD
case class Person(name: String, var age: Int)
case class SalesTransaction(
CustomerId : Int,
TransactionId : Int,
Score : Int,
Revenue : Double,
Type : String,
Credited : Double = 0.0,
LinkedTransactionId : Int = 0,
IsProcessed : Boolean = false
)
case class TransactionScore(
TransactionId : Int,
Score : Int
)
case class TransactionPair(
SalesId : Int,
CreditId : Int,
ScoreDiff : Int
)
object ExampleDataFramePartition{
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("Example Combiner")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val df = Seq(
(1, 1, 123, "Sales", 100),
(1, 2, 122, "Credit", 100),
(1, 3, 99, "Sales", 70),
(1, 4, 101, "Sales", 77),
(1, 5, 102, "Credit", 75),
(1, 6, 98, "Sales", 71),
(2, 7, 200, "Sales", 55),
(2, 8, 220, "Sales", 55),
(2, 9, 200, "Credit", 50),
(2, 10, 205, "Sales", 50)
).toDF("CustomerId", "TransactionId", "TransactionAttributesScore", "TransactionType", "Revenue")
.withColumn("Revenue", $"Revenue".cast(DoubleType))
.repartition(2,$"CustomerId")
df.show()
val dfOut2 = df.mapPartitions(p => {
println(p)
val trnMap = scala.collection.mutable.Map[Int, SalesTransaction]()
val trnSales = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnCredits = scala.collection.mutable.ArrayBuffer.empty[TransactionScore]
val trnPairs = scala.collection.mutable.ArrayBuffer.empty[TransactionPair]
p.foreach(row => {
val trnKey: Int = row.getAs[Int]("TransactionId")
val trnValue: SalesTransaction = new SalesTransaction(row.getAs("CustomerId")
, trnKey
, row.getAs("TransactionAttributesScore")
, row.getAs("Revenue")
, row.getAs("TransactionType")
)
trnMap += (trnKey -> trnValue)
if(trnValue.Type == "Sales") {
trnSales += new TransactionScore(trnKey, trnValue.Score)}
else {
trnCredits += new TransactionScore(trnKey, trnValue.Score)}
})
if(trnCredits.size > 0 && trnSales.size > 0) {
//define transaction pairs
trnCredits.foreach(cr => {
trnSales.foreach(sl => {
trnPairs += new TransactionPair(cr.TransactionId, sl.TransactionId, math.abs(cr.Score - sl.Score))
})
})
}
trnPairs.sortBy(t => t.ScoreDiff)
.foreach(t => {
if(!trnMap(t.CreditId).IsProcessed && !trnMap(t.SalesId).IsProcessed){
trnMap(t.SalesId) = new SalesTransaction(trnMap(t.SalesId).CustomerId
, trnMap(t.SalesId).TransactionId
, trnMap(t.SalesId).Score
, trnMap(t.SalesId).Revenue
, trnMap(t.SalesId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.CreditId
, true
)
trnMap(t.CreditId) = new SalesTransaction(trnMap(t.CreditId).CustomerId
, trnMap(t.CreditId).TransactionId
, trnMap(t.CreditId).Score
, trnMap(t.CreditId).Revenue
, trnMap(t.CreditId).Type
, math.min(trnMap(t.CreditId).Revenue, trnMap(t.SalesId).Revenue)
, t.SalesId
, true
)
}
})
trnMap.map(m => m._2).toIterator
})
dfOut2.show()
spark.stop()
}
}

PySpark - Add a new nested column or change the value of existing nested columns

Supposing, I have a json file with lines in follow structure:
{
"a": 1,
"b": {
"bb1": 1,
"bb2": 2
}
}
I want to change the value of key bb1 or add a new key, like: bb3.
Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame.
The workflow works as follow:
def map_func(row):
dictionary = row.asDict(True)
adding new key or changing key value
return as_row(dictionary) # as_row convert dict to row recursively
df = spark.read.json("json_file")
df.rdd.map(map_func).toDF().write.json("new_json_file")
This could work for me. But I concern that converting DataFrame -> RDD ( Row -> dict -> Row) -> DataFrame would kill the efficiency.
Is there any other methods that could work for this demand but not at the cost of efficiency?
The final solution that I used is using withColumn and dynamically building the schema of b.
Firstly, we can get the b_schema from df schema by:
b_schema = next(field['type'] for field in df.schema.jsonValue()['fields'] if field['name'] == 'b')
After that, b_schema is dict and we can add new field into it by:
b_schema['fields'].append({"metadata":{},"type":"string","name":"bb3","nullable":True})
And then, we could convert it to StructType by:
new_b = StructType.fromJson(b_schema)
In the map_func, we could convert Row to dict and populate the new field:
def map_func(row):
data = row.asDict(True)
data['bb3'] = data['bb1'] + data['bb2']
return data
map_udf = udf(map_func, new_b)
df.withColumn('b', map_udf('b')).collect()
Thanks #Mariusz
You can use map_func as udf and therefore omit converting DF -> RDD -> DF, still having the flexibility of python to implement business logic. All you need is to create schema object:
>>> from pyspark.sql.types import *
>>> new_b = StructType([StructField('bb1', LongType()), StructField('bb2', LongType()), StructField('bb3', LongType())])
Then you define map_func and udf:
>>> from pyspark.sql.functions import *
>>> def map_func(data):
... return {'bb1': 4, 'bb2': 5, 'bb3': 6}
...
>>> map_udf = udf(map_func, new_b)
Finally apply this UDF to dataframe:
>>> df = spark.read.json('sample.json')
>>> df.withColumn('b', map_udf('b')).first()
Row(a=1, b=Row(bb1=4, bb2=5, bb3=6))
EDIT:
According to the comment: You can add a field to existing StructType in a easier way, for example:
>>> df = spark.read.json('sample.json')
>>> new_b = df.schema['b'].dataType.add(StructField('bb3', LongType()))

Checking if an RDD(K,V) V is contains in another Rdd(K,V) V

I have two RDD(K,V),in spark it is not allow two map nesting。
val x = sc.parallelize(List((1,"abc"),(2,"efg")))
val y = sc.parallelize(List((1,"ab"),(2,"ef"), (3,"tag"))
I want check "abc" contains "ab", if the RDD is large or not.
Assuming you want to select a value from RDD x when it's substring is present in the RDD y then this code should work.
def main(args: Array[String]): Unit = {
val x = spark.sparkContext.parallelize(List((1, "abc"), (2, "efg")))
val y = spark.sparkContext.parallelize(List((1, "ab"), (2, "ef"), (3, "tag")))
// This RDD is filtered. That is we are selecting elements from x only if the substring of the value is present in
// the RDD y.
val filteredRDD = filterRDD(x, y)
// Now we map the filteredRDD to our result list
val resultArray = filteredRDD.map(x => x._2).collect()
}
def filterRDD(x: RDD[(Int, String)], y: RDD[(Int, String)]): RDD[(Int, String)] = {
// Broadcasting the y RDD to all spark nodes, since we are collecting this before hand.
// The reason we are collecting the y RDD is to avoid call collect in the filter logic
val y_bc = spark.sparkContext.broadcast(y.collect.toSet)
x.filter(m => {
y_bc.value.exists(n => m._2.contains(n._2))
})
}

Resources