Can I use UDAFs in window functions? - apache-spark

I created a user-defined aggregate function. It concatenates all accumulated values in to a list (ArrayType). It's called EdgeHistory.
If I don't specify the window it works fine. It returns an array of all the lists. But with the following example it fails:
case class ExampleRow(n: Int, list: List[(String, String, Float, Float)])
val x = Seq(
ExampleRow(1, List(("a", "b", 1f, 2f), ("c", "d", 2f, 3f))),
ExampleRow(2, List(("a", "b", 2f, 4f), ("c", "d", 4f, 6f))),
ExampleRow(3, List(("a", "b", 4f, 8f), ("c", "d", 8f, 12f)))
)
val df = sc.parallelize(x).toDF()
val edgeHistory = new EdgeHistory()
val y = df.agg(edgeHistory('list).over(Window.orderBy("n").rangeBetween(1, 0)))
It throws an error:
STDERR: Exception in thread "main" java.lang.UnsupportedOperationException: EdgeHistory('list) is not supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:177)
at org.apache.spark.sql.Column.over(Column.scala:1052)
at szdavid92.AnalyzeGraphStream$.main(AnalyzeGraphStream.scala:75)
The error message seems pretty straightforward. It seems you cannot define UDAFs in windows.
Do I understand correctly?
Why is this limitation?
UPDATE
I tried with SQL syntax and I get a related error
df.registerTempTable("data")
sqlContext.udf.register("edge_history", edgeHistory)
val y = sqlContext.sql(
"""
|SELECT n, list, edge_history(list) OVER (ORDER BY n ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
|FROM data
""".stripMargin)
that is
Exception in thread "main" org.apache.spark.sql.AnalysisException: Couldn't find window function edge_history;
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)

Related

Combine associated items in Spark

In Spark, I have a large list (millions) of elements that contain items associated with each other. Examples:
1: ("A", "C", "D") # Each of the items in this array is associated with any other element in the array, so A and C are associated, A, and D are associated, and C and D are associated.
2: ("F", "H", "I", "P")
3: ("H", "I", "D")
4: ("X", "Y", "Z")
I want to perform an operation to combine the associations where there are associations that go across the lists. In the example above, we can see that all the items of the first three lines are associated with each other (line 1 and line 2 should be combined because according line 3 D and I are associated). Therefore, the output should be:
("A", "C", "D", "F", "H", "I", "P")
("X", "Y", "Z")
What type of transformations in Spark can I use to perform this operation? I looked like various ways of grouping the data, but haven't found an obvious way to combine list elements if they share common elements.
Thank you!
As a couple of users have already stated, this can be seen as a graph problem, where you want to find the connected components in a graph.
As you are using spark, I think is a nice opportunity to show how to use graphx in python.
To run this example you will need pyspark and graphframes python packages.
from pyspark.sql import SparkSession
from graphframes import GraphFrame
from pyspark.sql import functions as f
spark = (
SparkSession.builder.appName("test")
.config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12")
.getOrCreate()
)
# graphframe requires defining a checkpoint dir.
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
# lets create a sample dataframe
df = spark.createDataFrame(
[
(1, ["A", "C", "D"]),
(2, ["F", "H", "I", "P"]),
(3, ["H", "I", "D"]),
(4, ["X", "Y", "Z"]),
],
["id", "values"],
)
# We can use the explode function to explode the lists in new rows having a list of (id, node)
df = df.withColumn("node", f.explode("values"))
df.createOrReplaceTempView("temp_table")
# Then we can join the table with itself to generate an edge table with source and destination nodes.
edge_table = spark.sql(
"""
SELECT
distinct a.node as src, b.node as dst
FROM
temp_table a join temp_table b
ON a.id=b.id AND a.node != b.node
"""
)
# Now we define our graph by using an edge table (a table with the node ids)
# and our edge table
# then we use the connectedComponents method to find the components
cc_df = GraphFrame(
df.selectExpr("node as id").drop_duplicates(), edge_table
).connectedComponents()
# The cc_df dataframe will have two columns, the node id and the connected component.
# To get the desired result we can group by the component and create a list
cc_df.groupBy("component").agg(f.collect_list("id")).show(truncate=False)
The output you will get looks like this:
You can install the dependencies by using:
pip install -q pyspark==3.2 graphframes
There probably isn't enough information in the question to fully solve this but I would suggest creating an adjacency matrix/list using GraphX to represent it as a graph. From there hopefully you can solve the rest of your problem.
https://en.wikipedia.org/wiki/Adjacency_matrix
https://spark.apache.org/docs/latest/graphx-programming-guide.html
If you are using a PySpark Kernel, this solution should work
iset = set([frozenset(s) for s in tuple_list]) # Convert to a set of sets
result = []
while(iset): # While there are sets left to process:
nset = set(iset.pop()) # Pop a new set
check = len(iset) # Does iset contain more sets
while check: # Until no more sets to check:
check = False
for s in iset.copy(): # For each other set:
if nset.intersection(s): # if they intersect:
check = True # Must recheck previous sets
iset.remove(s) # Remove it from remaining sets
nset.update(s) # Add it to the current set
result.append(tuple(nset)) # Convert back to a list of tuples

PySpark - convert RDD to pair key value RDD

I created rdd from CSV
lines = sc.textFile(data)
now I need to convert lines to key value rdd
where value where value will be string (after splitting) and key will be number of column of csv
for example CSV
Col 1
Col2
73
230666
55
149610
I want to get rdd.take(1):
[(1,73), (2, 230666)]
I create rdd of lists
lines_of_list = lines_data.map(lambda line : line.split(','))
I create function that get list and return list of tuples (key, value)
def list_of_tuple (l):
list_tup = []
for i in range(len(l[0])):
list_tup.append((l[0][i],i))
return(list_tup)
But I can’t get the correct result when I try to map this function on RDD
You can use the PySpark's create_map function to do so, like so:
from pyspark.sql.functions import create_map, col, lit
df = spark.createDataFrame([(73, 230666), (55, 149610)], "Col1: int, Col2: int")
mapped_df = df.select(create_map(lit(1), col("Col1")).alias("mappedCol1"), create_map(lit(2), col("Col2")).alias("mappedCol2"))
mapped_df.show()
+----------+-------------+
|mappedCol1| mappedCol2|
+----------+-------------+
| {1 -> 73}|{2 -> 230666}|
| {1 -> 55}|{2 -> 149610}|
+----------+-------------+
If you still want use RDD API, then its a property of DataFrame, so you can use it like so:
mapped_df.rdd.take(1)
Out[32]: [Row(mappedCol1={1: 73}, mappedCol2={2: 230666})]
I fixed the problem in this way:
def list_of_tuple (line_rdd):
l = line_rdd.split(',')
list_tup = []
for i in range(len(l)):
list_tup.append((l[i],i))
return(list_tup)
pairs_rdd = lines_data.map(lambda line: list_of_tuple(line))

Apache Spark - How to use groupBy groupByKey to form a (Key, List) pair

I have org.apache.spark.sql.DataFrame = [id: bigint, name: string] in hand
and sample data in it looks like:
(1, "City1")
(2, "City3")
(1, "CityX")
(4, "CityZ")
(2, "CityN")
I am trying to form a output like
(1, ("City1", "CityX"))
(2, ("City3", "CityN"))
(4, ("CityZ"))
I tried the following variants
df.groupByKey.mapValues(_.toList).show(20, false)
df.groupBy("id").show(20, false)
df.rdd.groupByKey.mapValues(_.toList).show(20, false)
df.rdd.groupBy("id").show(20, false)
All of them complain about either groupBy or groupByKey being ambiguous or method not found errors. Any help is appreciated.
I tried the solution posted in Spark Group By Key to (Key,List) Pair, however that doesn't work for me and it fails with the following error:
<console>:88: error: overloaded method value groupByKey with alternatives:
[K](func: org.apache.spark.api.java.function.MapFunction[org.apache.spark.sql.Row,K], encoder: org.apache.spark.sql.Encoder[K])org.apache.spark.sql.KeyValueGroupedDataset[K,org.apache.spark.sql.Row] <and>
[K](func: org.apache.spark.sql.Row => K)(implicit evidence$3: org.apache.spark.sql.Encoder[K])org.apache.spark.sql.KeyValueGroupedDataset[K,org.apache.spark.sql.Row]
cannot be applied to ()
Thanks.
Edit:
I did try the following:
val result = df.groupBy("id").agg(collect_list("name"))
which gives
org.apache.spark.sql.DataFrame = [id: bigint, collect_list(node): array<string>]
I am not sure how to use this collect_list type .. I am trying to dump this to a file by doing
result.rdd.coalesce(1).saveAsTextFile("test")
and I see the following
(1, WrappedArray(City1, CityX))
(2, WrappedArray(City3, CityN))
(4, WrappedArray(CityZ))
How do I dump this as the following ?
(1, (City1, CityX))
(2, (City3, CityN))
(4, (CityZ))
If you have an RDD of pairs, then you can use combineByKey(). To do this you have to pass 3 methods as arguments.
Method 1 takes a String, for example 'City1' as input, will add that String to an empty List and return that list
Method 2 takes a String, for example 'CityX' and one of the lists created by the previous method. Add the String to the list and return the list.
Method 3 takes 2 lists as input. It returns a new list with all the values from the 2 argument lists
combineByKey will then return an RDD>.
However in your case you are starting off with a DataFrame, which I do not have much experience with. I imagine that you will need to convert it to an RDD in order to use combineByKey()

Spark SQL refer to columns programmatically

I am about to develop a function which uses spark sql to perform an operation per column. In this function I need to refer to the columns name:
val input = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
The following example explicitly referring to columns via 'column works fine.
val pre1_1 = input.groupBy('col1).agg(mean($"TARGET").alias("pre_col1"))
val pre2_1 = input.groupBy('col1, 'TARGET).agg(count("*") / input.filter('TARGET === 1).count alias ("pre2_col1"))
input.as('a)
.join(pre1_1.as('b), $"a.col1" === $"b.col1").drop($"b.col1")
.join(pre2_1.as('b), ($"a.col1" === $"b.col1") and ($"a.TARGET" === $"b.TARGET")).drop($"b.col1").drop($"b.TARGET").show
When referring to the columns programmatically they can no longer be resolved. When 2 joins are performed one after the other which worked fine for the code snippet above.
I could observe that for this code snippet the first and initial col1 of df was moved from the beginning to the end. Probably this is the reason that it can no longer be resolved.
But so far I could not figure it out how to access the column when only passing a string / how to properly reference the colnames in a function.
val pre1_1 = input.groupBy("col1").agg(mean('TARGET).alias("pre_" + "col1"))
val pre2_1 = input.groupBy("col1", "TARGET").agg(count("*") / input.filter('TARGET === 1).count alias ("pre2_" + "col1"))
input.join(pre1_1, input("col1") === pre1_1("col1")).drop(pre1_1("col1"))
.join(pre2_1, (input("col1") === pre2_1("col1")) and (input("TARGET") === pre2_1("TARGET"))).drop(pre2_1("col1")).drop(pre2_1("TARGET"))
as well as an alternative approach like:
df.as('a)
.join(pre1_1.as('b), $"a.${col}" === $"b.${col}").drop($"b.${col}")
did not succeed as $"a.${col}" no longer was resolved to a.Column but rather df("a.col1") which does not exist.
In complex cases always use unique aliases to reference columns with shared lineage. This is the only way to ensure correct and stable behavior.
import org.apache.spark.sql.functions.col
val pre1_1 = input.groupBy("col1").agg(mean('TARGET).alias("pre_" + "col1")).alias("pre1_1")
val pre2_1 = input.groupBy("col1", "TARGET").agg(count("*") / input.filter('TARGET === 1).count alias ("pre2_" + "col1")).alias("pre2_1")
input.alias("input")
.join(pre1_1, col("input.col1") === col("pre1_1.col1"))
.join(pre2_1, (col("input.col1") === col("pre2_1.col1")) and (col("input.TARGET") === col("pre2_1.TARGET")))
If you check logs you actually see warnings like:
WARN Column: Constructing trivially true equals predicate, 'col1#12 = col1#12'. Perhaps you need to use aliases
and code you use work only because there are "special cases" in Spark source.
In simple case like this just use equi-join syntax:
input.join(pre1_1, Seq("col1"))
.join(pre2_1, Seq("col1", "TARGET"))

Spark Dataframe groupBy and sort results into a list

I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+

Resources