Spark SQL refer to columns programmatically - apache-spark

I am about to develop a function which uses spark sql to perform an operation per column. In this function I need to refer to the columns name:
val input = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
The following example explicitly referring to columns via 'column works fine.
val pre1_1 = input.groupBy('col1).agg(mean($"TARGET").alias("pre_col1"))
val pre2_1 = input.groupBy('col1, 'TARGET).agg(count("*") / input.filter('TARGET === 1).count alias ("pre2_col1"))
input.as('a)
.join(pre1_1.as('b), $"a.col1" === $"b.col1").drop($"b.col1")
.join(pre2_1.as('b), ($"a.col1" === $"b.col1") and ($"a.TARGET" === $"b.TARGET")).drop($"b.col1").drop($"b.TARGET").show
When referring to the columns programmatically they can no longer be resolved. When 2 joins are performed one after the other which worked fine for the code snippet above.
I could observe that for this code snippet the first and initial col1 of df was moved from the beginning to the end. Probably this is the reason that it can no longer be resolved.
But so far I could not figure it out how to access the column when only passing a string / how to properly reference the colnames in a function.
val pre1_1 = input.groupBy("col1").agg(mean('TARGET).alias("pre_" + "col1"))
val pre2_1 = input.groupBy("col1", "TARGET").agg(count("*") / input.filter('TARGET === 1).count alias ("pre2_" + "col1"))
input.join(pre1_1, input("col1") === pre1_1("col1")).drop(pre1_1("col1"))
.join(pre2_1, (input("col1") === pre2_1("col1")) and (input("TARGET") === pre2_1("TARGET"))).drop(pre2_1("col1")).drop(pre2_1("TARGET"))
as well as an alternative approach like:
df.as('a)
.join(pre1_1.as('b), $"a.${col}" === $"b.${col}").drop($"b.${col}")
did not succeed as $"a.${col}" no longer was resolved to a.Column but rather df("a.col1") which does not exist.

In complex cases always use unique aliases to reference columns with shared lineage. This is the only way to ensure correct and stable behavior.
import org.apache.spark.sql.functions.col
val pre1_1 = input.groupBy("col1").agg(mean('TARGET).alias("pre_" + "col1")).alias("pre1_1")
val pre2_1 = input.groupBy("col1", "TARGET").agg(count("*") / input.filter('TARGET === 1).count alias ("pre2_" + "col1")).alias("pre2_1")
input.alias("input")
.join(pre1_1, col("input.col1") === col("pre1_1.col1"))
.join(pre2_1, (col("input.col1") === col("pre2_1.col1")) and (col("input.TARGET") === col("pre2_1.TARGET")))
If you check logs you actually see warnings like:
WARN Column: Constructing trivially true equals predicate, 'col1#12 = col1#12'. Perhaps you need to use aliases
and code you use work only because there are "special cases" in Spark source.
In simple case like this just use equi-join syntax:
input.join(pre1_1, Seq("col1"))
.join(pre2_1, Seq("col1", "TARGET"))

Related

Combine associated items in Spark

In Spark, I have a large list (millions) of elements that contain items associated with each other. Examples:
1: ("A", "C", "D") # Each of the items in this array is associated with any other element in the array, so A and C are associated, A, and D are associated, and C and D are associated.
2: ("F", "H", "I", "P")
3: ("H", "I", "D")
4: ("X", "Y", "Z")
I want to perform an operation to combine the associations where there are associations that go across the lists. In the example above, we can see that all the items of the first three lines are associated with each other (line 1 and line 2 should be combined because according line 3 D and I are associated). Therefore, the output should be:
("A", "C", "D", "F", "H", "I", "P")
("X", "Y", "Z")
What type of transformations in Spark can I use to perform this operation? I looked like various ways of grouping the data, but haven't found an obvious way to combine list elements if they share common elements.
Thank you!
As a couple of users have already stated, this can be seen as a graph problem, where you want to find the connected components in a graph.
As you are using spark, I think is a nice opportunity to show how to use graphx in python.
To run this example you will need pyspark and graphframes python packages.
from pyspark.sql import SparkSession
from graphframes import GraphFrame
from pyspark.sql import functions as f
spark = (
SparkSession.builder.appName("test")
.config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12")
.getOrCreate()
)
# graphframe requires defining a checkpoint dir.
spark.sparkContext.setCheckpointDir("/tmp/checkpoint")
# lets create a sample dataframe
df = spark.createDataFrame(
[
(1, ["A", "C", "D"]),
(2, ["F", "H", "I", "P"]),
(3, ["H", "I", "D"]),
(4, ["X", "Y", "Z"]),
],
["id", "values"],
)
# We can use the explode function to explode the lists in new rows having a list of (id, node)
df = df.withColumn("node", f.explode("values"))
df.createOrReplaceTempView("temp_table")
# Then we can join the table with itself to generate an edge table with source and destination nodes.
edge_table = spark.sql(
"""
SELECT
distinct a.node as src, b.node as dst
FROM
temp_table a join temp_table b
ON a.id=b.id AND a.node != b.node
"""
)
# Now we define our graph by using an edge table (a table with the node ids)
# and our edge table
# then we use the connectedComponents method to find the components
cc_df = GraphFrame(
df.selectExpr("node as id").drop_duplicates(), edge_table
).connectedComponents()
# The cc_df dataframe will have two columns, the node id and the connected component.
# To get the desired result we can group by the component and create a list
cc_df.groupBy("component").agg(f.collect_list("id")).show(truncate=False)
The output you will get looks like this:
You can install the dependencies by using:
pip install -q pyspark==3.2 graphframes
There probably isn't enough information in the question to fully solve this but I would suggest creating an adjacency matrix/list using GraphX to represent it as a graph. From there hopefully you can solve the rest of your problem.
https://en.wikipedia.org/wiki/Adjacency_matrix
https://spark.apache.org/docs/latest/graphx-programming-guide.html
If you are using a PySpark Kernel, this solution should work
iset = set([frozenset(s) for s in tuple_list]) # Convert to a set of sets
result = []
while(iset): # While there are sets left to process:
nset = set(iset.pop()) # Pop a new set
check = len(iset) # Does iset contain more sets
while check: # Until no more sets to check:
check = False
for s in iset.copy(): # For each other set:
if nset.intersection(s): # if they intersect:
check = True # Must recheck previous sets
iset.remove(s) # Remove it from remaining sets
nset.update(s) # Add it to the current set
result.append(tuple(nset)) # Convert back to a list of tuples

Check Spark Dataframe row has ANY column meeting a condition and stop when first such column found

The following code can be used to filter rows that contain a value of 1. Image there are a lot of columns.
import org.apache.spark.sql.types.StructType
val df = sc.parallelize(Seq(
("r1", 1, 1),
("r2", 6, 4),
("r3", 4, 1),
("r4", 1, 2)
)).toDF("ID", "a", "b")
val ones = df.schema.map(c => c.name).drop(1).map(x => when(col(x) === 1, 1).otherwise(0)).reduce(_ + _)
df.withColumn("ones", ones).where($"ones" === 0).show
The downside here is that it should ideally stop when the first such condition is met. I.e. the first column found. OK, we all know that.
But I cannot find an elegant method to achieve this without presumably using a UDF or very specific logic. The map will process all cols.
Can therefore a fold(Left) be used that can terminate when first occurrence found possibly? Or some other approach? May be an oversight.
My first idea was to use logical expressions and hope for short-circuiting, but it seems spark is not doing this :
df
.withColumn("ones", df.columns.tail.map(x => when(col(x) === 1, true)
.otherwise(false)).reduceLeft(_ or _))
.where(!$"ones")
.show()
But I'm no sure whether spark does support short-circuiting, I think not (https://issues.apache.org/jira/browse/SPARK-18712)
So alternatively you can apply a custom function on your rows using lazy exist on scala's Seq:
df
.map{r => (r.getString(0),r.toSeq.tail.exists(c => c.asInstanceOf[Int]==1))}
.toDF("ID","ones")
.show()
This approach is similar to an UDF, so not sure if thats what you accept.

Using self-defined data transform function in Spark Structured Stream

I read the following blog and find the API is very useful.
https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html
In the blog, there are lots of data selection example. Like using input
{
"a": {
"b": 1
}
}
Apply Scala: events.select("a.b"), the output would be
{
"b": 1
}
But data type conversion are not mentioned in the blog. Saying I have the following input:
{
"timestampInSec": "1514917353",
"ip": "123.39.76.112",
"money": "USD256",
"countInString": "6"
}
The expected output is:
{
"timestamp": "2018-01-02 11:22:33",
"ip_long": 2066173040,
"currency": "USD",
"money_amount": 256,
"count": 6
}
There are some transformations that not included in org.apache.spark.sql.functions._:
Timestamp is in second and is a string type
Convert IP to long
Split USD256 to two columns and convert one of the column to number
Convert string to number
Another thing is error handling and default value. If there is an invalid input like:
{
"timestampInSec": "N/A",
"money": "999",
"countInString": "Number-Six"
}
It is expected that the output can be
{
"timestamp": "1970-01-01 00:00:00",
"ip_long": 0,
"currency": "NA",
"money_amount": 999,
"count": -1
}
input timestampInSec is not a number. It is expected to use 0 and create a timestamp string as return value
ip is missing in the input. It is expected to usea default value 0.
money field is not complete. It has money amount but missed currency. It is expected to use NA as default currency and correctly translate the money_amount
countInString is not a number. It is expected to use -1 (not 0) as default value .
These requirments are not common and need some self-defined business logic code.
I do checked some function like to_timestamp. There are some codegen stuff and seems not very easy to add new functions. Is there some guide/document about writing self-defined transformation function? Is there a easy way to meet the requirments?
For all:
import org.apache.spark.sql.functions._
Timestamp is in second and is a string type
val timestamp = coalesce(
$"timestampInSec".cast("long").cast("timestamp"),
lit(0).cast("timestamp")
).alias("timestamp")
Split USD256 to two columns and convert one of the column to number
val currencyPattern = "^([A-Z]+)?([0-9]+)$"
val currency = (trim(regexp_extract($"money", currencyPattern, 1)) match {
case c => when(length(c) === 0, "NA").otherwise(c)
}).alias("currency")
val amount = regexp_extract($"money", currencyPattern, 2)
.cast("decimal(38, 0)").alias("money_amount")
Convert string to number
val count = coalesce($"countInString".cast("long"), lit(-1)).alias("count")
Convert IP to long
val ipPattern = "^([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})\\.([0-9]{1,3})"
val ip_long = coalesce(Seq((1, 24), (2, 16), (3, 8), (4, 0)).map {
case (group, numBits) => shiftLeft(
regexp_extract($"ip", ipPattern, group).cast("long"),
numBits
)
}.reduce(_ + _), lit(0)).alias("ip_long")
Result
val df = Seq(
("1514917353", "123.39.76.112", "USD256", "6"),
("N/A", null, "999", null)
).toDF("timestampInSec", "ip", "money", "countInString")
df.select(timestamp, currency, amount, count, ip_long).show
// +-------------------+--------+------------+-----+----------+
// | timestamp|currency|money_amount|count| ip_long|
// +-------------------+--------+------------+-----+----------+
// |2018-01-02 18:22:33| USD| 256| 6|2066173040|
// |1970-01-01 00:00:00| NA| 999| -1| 0|
// +-------------------+--------+------------+-----+----------+

Spark Dataframe groupBy and sort results into a list

I have a Spark Dataframe and I would like to group the elements by a key and have the results as a sorted list
Currently I am using:
df.groupBy("columnA").agg(collect_list("columnB"))
How do I make the items in the list sorted ascending order?
You could try the function sort_array available in the functions package:
import org.apache.spark.sql.functions._
df.groupBy("columnA").agg(sort_array(collect_list("columnB")))
Just wanted to add another hint to the answer of Daniel de Paula regarding sort_array solution.
If you want to sort elements according to a different column, you can form a struct of two fields:
the sort by field
the result field
Since structs are sorted field by field, you'll get the order you want, all you need is to get rid of the sort by column in each element of the resulting list.
The same approach can be applied with several sort by columns when needed.
Here's an example that can be run in local spark-shell (use :paste mode):
import org.apache.spark.sql.Row
import spark.implicits._
case class Employee(name: String, department: String, salary: Double)
val employees = Seq(
Employee("JSMITH", "A", 20.0),
Employee("AJOHNSON", "A", 650.0),
Employee("CBAKER", "A", 650.2),
Employee("TGREEN", "A", 13.0),
Employee("CHORTON", "B", 111.0),
Employee("AIVANOV", "B", 233.0),
Employee("VSMIRNOV", "B", 11.0)
)
val employeesDF = spark.createDataFrame(employees)
val getNames = udf { salaryNames: Seq[Row] =>
salaryNames.map { case Row(_: Double, name: String) => name }
}
employeesDF
.groupBy($"department")
.agg(collect_list(struct($"salary", $"name")).as("salaryNames"))
.withColumn("namesSortedBySalary", getNames(sort_array($"salaryNames", asc = false)))
.show(truncate = false)
The result:
+----------+--------------------------------------------------------------------+----------------------------------+
|department|salaryNames |namesSortedBySalary |
+----------+--------------------------------------------------------------------+----------------------------------+
|B |[[111.0, CHORTON], [233.0, AIVANOV], [11.0, VSMIRNOV]] |[AIVANOV, CHORTON, VSMIRNOV] |
|A |[[20.0, JSMITH], [650.0, AJOHNSON], [650.2, CBAKER], [13.0, TGREEN]]|[CBAKER, AJOHNSON, JSMITH, TGREEN]|
+----------+--------------------------------------------------------------------+----------------------------------+

Can I use UDAFs in window functions?

I created a user-defined aggregate function. It concatenates all accumulated values in to a list (ArrayType). It's called EdgeHistory.
If I don't specify the window it works fine. It returns an array of all the lists. But with the following example it fails:
case class ExampleRow(n: Int, list: List[(String, String, Float, Float)])
val x = Seq(
ExampleRow(1, List(("a", "b", 1f, 2f), ("c", "d", 2f, 3f))),
ExampleRow(2, List(("a", "b", 2f, 4f), ("c", "d", 4f, 6f))),
ExampleRow(3, List(("a", "b", 4f, 8f), ("c", "d", 8f, 12f)))
)
val df = sc.parallelize(x).toDF()
val edgeHistory = new EdgeHistory()
val y = df.agg(edgeHistory('list).over(Window.orderBy("n").rangeBetween(1, 0)))
It throws an error:
STDERR: Exception in thread "main" java.lang.UnsupportedOperationException: EdgeHistory('list) is not supported in a window operation.
at org.apache.spark.sql.expressions.WindowSpec.withAggregate(WindowSpec.scala:177)
at org.apache.spark.sql.Column.over(Column.scala:1052)
at szdavid92.AnalyzeGraphStream$.main(AnalyzeGraphStream.scala:75)
The error message seems pretty straightforward. It seems you cannot define UDAFs in windows.
Do I understand correctly?
Why is this limitation?
UPDATE
I tried with SQL syntax and I get a related error
df.registerTempTable("data")
sqlContext.udf.register("edge_history", edgeHistory)
val y = sqlContext.sql(
"""
|SELECT n, list, edge_history(list) OVER (ORDER BY n ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
|FROM data
""".stripMargin)
that is
Exception in thread "main" org.apache.spark.sql.AnalysisException: Couldn't find window function edge_history;
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)
at org.apache.spark.sql.hive.ResolveHiveWindowFunction$$anonfun$apply$1$$anonfun$applyOrElse$1$$anonfun$3.apply(hiveUDFs.scala:288)

Resources