Array Intersection in Spark SQL - apache-spark

I have a table with a array type column named writer which has the values like array[value1, value2], array[value2, value3].... etc.
I am doing self join to get results which have common values between arrays. I tried:
sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECTION(R1.writer, R2.writer)[0] is not null ")
And
sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECT(R1.writer, R2.writer)[0] is not null ")
But got same exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Undefined function: 'ARRAY_INTERSECT'. This function is neither a
registered temporary function nor a permanent function registered in
the database 'default'.; line 1 pos 80
Probably Spark SQL does not support ARRAY_INTERSECTION and ARRAY_INTERSECT. How can I achieve my goal in Spark SQL?

Since Spark 2.4 array_intersect function can be used directly in SQL
spark.sql(
"SELECT array_intersect(array(1, 42), array(42, 3)) AS intersection"
).show()
+------------+
|intersection|
+------------+
| [42]|
+------------+
and Dataset API:
import org.apache.spark.sql.functions.array_intersect
Seq((Seq(1, 42), Seq(42, 3)))
.toDF("a", "b")
.select(array_intersect($"a", $"b") as "intersection")
.show()
+------------+
|intersection|
+------------+
| [42]|
+------------+
Equivalent functions are also present in the other languages:
pyspark.sql.functions.array_intersect in PySpark.
SparkR::array_intersect in SparkR.

You'll need an udf:
import org.apache.spark.sql.functions.udf
spark.udf.register("array_intersect",
(xs: Seq[String], ys: Seq[String]) => xs.intersect(ys))
and then check if intersection is empty:
scala> spark.sql("SELECT size(array_intersect(array('1', '2'), array('3', '4'))) = 0").show
+-----------------------------------------+
|(size(UDF(array(1, 2), array(3, 4))) = 0)|
+-----------------------------------------+
| true|
+-----------------------------------------+
scala> spark.sql("SELECT size(array_intersect(array('1', '2'), array('1', '4'))) = 0").show
+-----------------------------------------+
|(size(UDF(array(1, 2), array(1, 4))) = 0)|
+-----------------------------------------+
| false|
+-----------------------------------------+

Related

Merging rows to map type based on max value in a column

I am doing a small POC to ingest the user events(CSV file) from a website. Below is the sample input:
Input Schema:
Output should be in the format as below
The logic required is to group by the id column and merge the
name and value columns to a Map type where the name column represents the key
and the value column represent the value in the Map type. The value to be picked for each key in the Map is the one with the highest value in the timestamp column.
I was able to achieve some part where it needs to be grouped by id and extract maximum of the timestamp column.I am facing difficulty with selecting one value(from corresponding max timestamp) for each id) and merge with other names(using map).
Below is my code
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val schema = StructType(List(
StructField("id", LongType, nullable = true),
StructField("name", StringType, nullable = true),
StructField("value", StringType, nullable = true),
StructField("timestamp", LongType, nullable = true)))
val myDF = spark.read.schema(schema).option("header", "true").option("delimiter", ",").csv("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/tru.csv")
val df = myDF.toDF("id","name","value","timestamp")
//df.groupBy("id","name","value").agg(max("timestamp")).show()
val windowSpecAgg = Window.partitionBy("id")
df.withColumn("max", max(col("timestamp")).over(windowSpecAgg)).where(col("timestamp") === col("max")).drop("max").show()
Use window function and filter out latest data by partitioning on "id","name"
later use map_from_arrays,to_json functions to recreate the desired json.
Example:
df.show()
//sample data
//+---+----+-------+---------+
//| id|name| value|timestamp|
//+---+----+-------+---------+
//| 1| A| Exited| 3201|
//| 1| A|Running| 5648|
//| 1| C| Exited| 3547|
//| 2| C|Success| 3612|
//+---+----+-------+---------+
val windowSpecAgg = Window.partitionBy("id","name").orderBy(desc("timestamp"))
df.withColumn("max", row_number().over(windowSpecAgg)).filter(col("max")===1).
drop("max").
groupBy("id").
agg(to_json(map_from_arrays(collect_list(col("name")),collect_list(col("value")))).as("settings")).
show(10,false)
//+---+----------------------------+
//|id |settings |
//+---+----------------------------+
//|1 |{"A":"Running","C":"Exited"}|
//|2 |{"C":"Success"} |
//+---+----------------------------+
You can use ranking function - row_number() to get the latest records per partition.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df = Seq((1, "A", "Exited", 1546333201),
(3, "B", "Failed", 1546334201),
(2, "C", "Success", 1546333612),
(3, "B", "Hold", 1546333444),
(1, "A", "Running", 1546335648),
(1, "C", "Exited", 1546333547)).toDF("id", "name", "value", "timestamp")
df.withColumn("rn",
row_number().over(Window.partitionBy("id", "name").orderBy('timestamp.desc_nulls_last)))
.where('rn === 1)
.drop("rn")
.groupBy("id")
.agg(collect_list(map('name, 'value)).as("settings"))
.show(false)
/*
+---+-------------------------------+
|id |settings |
+---+-------------------------------+
|1 |[[A -> Running], [C -> Exited]]|
|3 |[[B -> Failed]] |
|2 |[[C -> Success]] |
+---+-------------------------------+ */

How to get key of single datafram in spark joins

suppose I have 2 datasets like below
book
case class Book(book_name: String, cost: Int, writer_id:Int)
val bookDS = Seq(
Book("Scala", 400, 1),
Book("Spark", 500, 2),
Book("Kafka", 300, 3),
Book("Java", 350, 5)
).toDS()
bookDS.show()
Writer
case class Writer(writer_name: String, writer_id:Int)
val writerDS = Seq(
Writer("Martin",1),
Writer("Zaharia " 2),
Writer("Neha", 3),
Writer("James", 4)
).toDS()
writerDS.show()
When I inner join it it returns two times writer_id.
How can I get writer_id of only one dataset.
I don't want to write sql like select a.something,b.something.
writerDS.join(bookDS, Seq("writer_id")).show()
Output:
+---------+-----------+---------+----+
|writer_id|writer_name|book_name|cost|
+---------+-----------+---------+----+
| 1| Martin| Scala| 400|
| 2| Zaharia| Spark| 500|
| 3| Neha| Kafka| 300|
+---------+-----------+---------+----+
When we join two datasets all columns from both datasets will presetn in result dataset
So you can rename it and then drop one of those two column.
Dataset<Row> joinedDataset = bookDs
.withcolumnRenamed(writer_id,book_writer_id)
.join(writerDS,new Column(book_writer_id).equalTo(writer_id),"inner")
.drop(book_writer_id);
Not sure if you are using python or scala.
Its a java code please convert it accordingly.

Getting the table name from a Spark Dataframe

If I have a dataframe created as follows:
df = spark.table("tblName")
Is there anyway that I can get back tblName from df?
You can extract it from the plan:
df.logicalPlan().argString().replace("`","")
We can extract tablename from a dataframe by parsing unresolved logical plan.
Please follow the method below:
def getTableName(df: DataFrame): String = {
Seq(df.queryExecution.logical, df.queryExecution.optimizedPlan).flatMap{_.collect{
case LogicalRelation(_, _, catalogTable: Option[CatalogTable], _) =>
if (catalogTable.isDefined) {
Some(catalogTable.get.identifier.toString())
} else None
case hive: HiveTableRelation => Some(hive.tableMeta.identifier.toString())
}
}.flatten.head
}
scala> val df = spark.table("db.table")
scala> getTableName(df)
res: String = `db`.`table`
Following utility function may be helpful to determine the table name from given DataFrame.
def get_dataframe_tablename(df: pyspark.sql.DataFrame) -> typing.Optional[str]:
"""
If the dataframe was created from an underlying table (e.g. spark.table('dual') or
spark.sql("select * from dual"), this function will return the
fully qualified table name (e.g. `default`.`dual`) as output otherwise it will return None.
Test on: python 3.7, spark 3.0.1, but it should work with Spark >=2.x and python >=3.4 too
Examples:
>>> get_dataframe_tablename(spark.table('dual'))
`default`.`dual`
>>> get_dataframe_tablename(spark.sql("select * from dual"))
`default`.`dual`
It inspects the output of `df.explain()` to determine that the df was created from a table or not
:param df: input dataframe whose underlying table name will be return
:return: table name or None
"""
def _explain(_df: pyspark.sql.DataFrame) -> str:
# df.explain() does not take parameter to accept the out and dump the output on stdout
# by default
import contextlib
import io
with contextlib.redirect_stdout(io.StringIO()) as f:
_df.explain()
f.seek(0) # Rewind stream position
explanation = f.readlines()[1] # Ignore first output line(#Physical Plan...)
return explanation
pattern = re.compile("Scan hive (.+), HiveTableRelation (.+?), (.+)")
output = _explain(df)
match = pattern.search(output)
return match.group(2) if match else None
Below three line of code will give table and database name
import org.apache.spark.sql.execution.FileSourceScanExec
df=session.table("dealer")
df.queryExecution.sparkPlan.asInstanceOf[FileSourceScanExec].tableIdentifier
Any answer on this one yet? I found a way but it's probably not the prettiest. You can access the tablename by retrieving the physical execution plan and then doing some string splitting magic on it.
Let's say you have a table from database_name.tblName. The following should work:
execution_plan = df.__jdf.queryExecution().simpleString()
table_name = string.split('FileScan')[1].split('[')[0].split('.')[1]
The first line will return your execution plan in a string format. That will look similar to this:
== Physical Plan ==\n*(1) ColumnarToRow\n+- FileScan parquet database_name.tblName[column1#2880,column2ban#2881] Batched: true, DataFilters: [], Format: Parquet, Location: PreparedDeltaFileIndex[dbfs:/mnt/lake/database_name/table_name], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<column1:string,column2:string...\n\n'
After that you can run some string splitting to access the relevant information. The first string split gets you all the elements of FileScan- you are interested in the second element, then before and after the [- here the first element is of interest. The second string split after . will return tblName
You can create table from df. But if table is a local temporary view or a global temporary view you should drop it (sqlContext.dropTempTable) before create a table with same name or use create or replace function (spark.createOrReplaceGlobalTempView or spark.createOrReplaceTempView). If table is temp table you can create table with same name without error
#Create data frame
>>> d = [('Alice', 1)]
>>> test_df = spark.createDataFrame(sc.parallelize(d), ['name','age'])
>>> test_df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tables
>>> test_df.createTempView("tbl1")
>>> test_df.registerTempTable("tbl2")
>>> sqlContext.tables().show()
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| | tbl1| true|
| | tbl2| true|
+--------+---------+-----------+
#create data frame from tbl1
>>> df = spark.table("tbl1")
>>> df.show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create tbl1 again with using df data frame. It will get error
>>> df.createTempView("tbl1")
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Temporary view 'tbl1' already exists;"
#drop and create again
>>> sqlContext.dropTempTable('tbl1')
>>> df.createTempView("tbl1")
>>> spark.sql('select * from tbl1').show()
+-----+---+
| name|age|
+-----+---+
|Alice| 1|
+-----+---+
#create data frame from tbl2 and replace name value
>>> df = spark.table("tbl2")
>>> df = df.replace('Alice', 'Bob')
>>> df.show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+
#create tbl2 again with using df data frame
>>> df.registerTempTable("tbl2")
>>> spark.sql('select * from tbl2').show()
+----+---+
|name|age|
+----+---+
| Bob| 1|
+----+---+

Spark HiveContext get the same format as hive client select

When a Hive table has values like maps or arrays, if you select it in the Hive client they are shown as JSON, e.g.: {"a":1,"b":1} or [1,2,2].
When you select those in Spark, they are map/array objects in the DataFrame. If you stringify each row they are Map("a" -> 1, "b" -> 1) or WrappedArray(1, 2, 2).
I want to have the same format as the Hive client when using Spark's HiveContext.
How can I do this?
Spark has its own functions to convert complex objects into their JSON representation.
Here is the documentation for the org.apache.spark.sql.functions package, which also comes with the to_json function that does the following:
Converts a column containing a StructType, ArrayType of StructTypes, a MapType or ArrayType of MapTypes into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.
Here is a short example as ran on the spark-shell:
scala> val df = spark.createDataFrame(
| Seq(("hello", Map("a" -> 1)), ("world", Map("b" -> 2)))
| ).toDF("name", "map")
df: org.apache.spark.sql.DataFrame = [name: string, map: map<string,int>]
scala> df.show
+-----+-----------+
| name| map|
+-----+-----------+
|hello|Map(a -> 1)|
|world|Map(b -> 2)|
+-----+-----------+
scala> df.select($"name", to_json(struct($"map")) as "json").show
+-----+---------------+
| name| json|
+-----+---------------+
|hello|{"map":{"a":1}}|
|world|{"map":{"b":2}}|
+-----+---------------+
Here is a similar example, with arrays instead of maps:
scala> val df = spark.createDataFrame(
| Seq(("hello", Seq("a", "b")), ("world", Seq("c", "d")))
| ).toDF("name", "array")
df: org.apache.spark.sql.DataFrame = [name: string, array: array<string>]
scala> df.select($"name", to_json(struct($"array")) as "json").show
+-----+-------------------+
| name| json|
+-----+-------------------+
|hello|{"array":["a","b"]}|
|world|{"array":["c","d"]}|
+-----+-------------------+

SparkSQL — collect_set and sort_array does not sort integer column properly

I want to generate a sorted, collected set in SparkSQL, like so:
spark.sql("SELECT id, col_2, sort_array(collect_set(value)) AS collected
FROM my_table GROUP BY id, col_2").show()
where value is an integer.
But it fails to sort the array in proper numeric order — and does something rather ad hoc (sort on beginning of the first number in the value instead? Is sort_array operating on a string?).
So instead of:
+----+-------+------------+
| id | col_2 | collected |
+----+-------+------------+
| 1 | 2 | [456,1234]|
+----+-------+------------+
I get:
+----+-------+------------+
| id | col_2 | collected |
+----+-------+------------+
| 1 | 2 | [1234,456]|
+----+-------+------------+
EDIT:
Looking at what spark.sql(…) returns it is obvious that this query returns strings instead:
DataFrame[id: string, col_2: string, collected: array<string>]
How can that be when the original dataframe is all integers.
EDIT 2:
This seems to be a problem related to pyspark, as I'm not experiencing the problem with spark-shell and writing the same stuff in scala
I tested with Apache Spark 2.0.0.
It works for me. To make sure I tested with data [(1, 2, 1234), (1, 2, 456)] and [(1, 2, 456), (1, 2, 1234)]. The result is same.
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([(1, 2, 1234), (1, 2, 456)], ['id', 'col_2', 'value'])
# test with reversed order, too
#df = sqlContext.createDataFrame([(1, 2, 456), (1, 2, 1234)], ['id', 'col_2', 'value'])
df.createOrReplaceTempView("my_table")
sqlContext.sql("SELECT id, col_2, sort_array(collect_set(value)) AS collected FROM my_table GROUP BY id, col_2").show()
Result
+---+-----+-----------+
| id|col_2| collected|
+---+-----+-----------+
| 1| 2|[456, 1234]|
+---+-----+-----------+
Some observations
when a value is None it appears as null e.g. [null, 456, 1234]
when there is a string value, Spark throws error "TypeError: Can not merge type LongType and StringType"
I think the problem is not the SQL but in the earlier steps where DataFrame was created.

Resources