Does SparkSQL support subquery? lists that currently no subquery support is available for spark 2.0.
Has this changed recently?
Your comment is correct. Your question is a little vague. However, I take your point and find also the concepts fine and also subject to this sort of question, so there you go.
So, this is now possible for the DataFrame API, not DataSet or DSL as you state.
SELECT A.dep_id,
A.employee_id,
A.age,
(SELECT MAX(age)
FROM employee B
WHERE A.dep_id = B.dep_id) max_age
FROM employee A
ORDER BY 1,2
An example - borrowed from the Internet, shows clearly the distinction between DS and DF implying that a SPARK SQL correlated sub-query (not shown here of course) does also not happen against a DataSet - by deduction:
sql("SELECT COUNT(*) FROM src").show()
val sqlDF = sql("SELECT key, value FROM src WHERE key < 10 ORDER BY key")
val stringsDS = sqlDF.map {case Row(key: Int, value: String) => s"Key: $key, Value: $value"}
stringsDS.show()
The SQL runs against some source like Hive or Parquet or against SPARK TempViews, not against a DS. From a DF you can go to the DS and then enjoy the more typesafe approach, but only with the limited interface on select. I did a good search to find something that disproves this, but this is not the case. DS and DF are sort of interchangeable anyway as I have stated I think to you earlier. But, I see you are very thorough!
Moreover, there are at least 2 techniques for converting the Nested-Correlated=Subqueries to "normal" JOINs which is what SPARK and indeed other Optimizers do in the background. E.g. RewriteCorrelatedScalarSubquery and PullupCorrelatedPredicate.
But for a DSL, which you allude to, you can re-write your query by hand to achieve the same, by using JOIN, LEFT JOIN, OUTER JOIN, whatever the case may be. Except that is not so obvious for all oddly enough.
Related
I have inherited a slow query in Informix. I suspect part of the slowness is due to the use of subqueries to do left outer joins. Here is a sample of the code:
FROM intide_rec AS IDE
LEFT OUTER JOIN (SELECT idp_cmpy_id, idp_idc_ctl_no, idp_itm_ctl_no, idp_brh, idp_invt_typ, idp_frm, idp_grd, idp_size, idp_fnsh, idp_whs, idp_mill, idp_heat, idp_tag_no, idp_num_size1, idp_num_size2, idp_num_size3, idp_num_size4, idp_num_size5, idp_wdth, idp_lgth, idp_idia, idp_odia, idp_ga_size, idp_ohd_mat_val, idp_ohd_pcs, idp_ohd_wgt, idp_invt_sts, idp_invt_qlty, idp_bgt_for, idp_ownr_id FROM intidp_rec) AS IDP ON (IDE.ide_cmpy_id = IDP.idp_cmpy_id AND IDE.ide_idc_ctl_no = IDP.idp_idc_ctl_no)
LEFT OUTER JOIN (SELECT prm_pep, prm_frm, prm_grd, prm_size, prm_fnsh FROM inrprm_rec) AS PRM ON
(IDP.idp_frm = PRM.prm_frm AND IDP.idp_grd = PRM.prm_grd AND IDP.idp_size = PRM.prm_size AND IDP.idp_fnsh = PRM.prm_fnsh)
Notice that the subqueries are simply retrieving columns. There is no manipulation of the columns. What is odd to me is why there are SELECT statements, i.e. subqueries, here.
Why not just remove the subqueries, move the columns out of the subqueries and into the main SELECT statement since there is no manipulation of columns and write the joins like this:
FROM intide_rec AS IDE
LEFT OUTER JOIN intidp_rec AS IDP ON (IDE.ide_cmpy_id = IDP.idp_cmpy_id AND IDE.ide_idc_ctl_no = IDP.idp_idc_ctl_no)
LEFT OUTER JOIN inrprm_rec AS PRM ON (IDP.idp_frm = PRM.prm_frm AND IDP.idp_grd = PRM.prm_grd AND IDP.idp_size = PRM.prm_size AND IDP.idp_fnsh = PRM.prm_fnsh)
What are your thoughts on the original code and subqueries vs the way I have rewritten the code? Is it inefficient from a performance perspective? Or is it acceptable from a performance perspective?
Thanks for any thoughts.
One way to provide some answer is to analyze the output from SET EXPLAIN ON for the two queries. Ideally, there shouldn't be a difference between the query plans. If the query plans are demonstrably 'the same' or 'equivalent', then the optimizer is doing its stuff well. Determining that they are equivalent may be harder than either of us would like. However, if there is a major difference in the query plans, the subqueries probably are slower and your rewrite should be at least as fast as the original and probably faster. Also, remember that query plans are only indicative of what the optimizer thinks will happen — time the different queries on production data as well.
You don't mention which version of Informix you're using or which platform you're using it on. It probably doesn't matter and it must be a relatively recent version to support the LEFT OUTER JOIN notation (this millennium rather than the last, at any rate). However, it is beneficial to state that. Note that only versions 12.10 and 14.10 are under support unless you've made special arrangements with IBM or HCL.
I am working with a legacy Spark SQL code like this:
SELECT
column1,
max(column2),
first_value(column3),
last_value(column4)
FROM
tableA
GROUP BY
column1
ORDER BY
columnN
I am rewriting it in PySpark as below
df.groupBy(column1).agg(max(column2), first(column3), last(column4)).orderBy(columnN)
When I'm comparing the two outcomes I can see differences in the fields generated by the first_value/first and last_value/last functions.
Are they behaving in a non-deterministic way when used outside of Window functions?
Can groupBy aggregates be combined with Window functions?
This behaviour is possible when you have a wide table and you don't specify ordering for the remaining columns. What happens under the hood is that spark takes first() or last() row, whichever is available to it as the first condition-matching row on the heap. Spark SQL and pyspark might access different elements because the ordering is not specified for the remaining columns.
In terms of Window function, you can use a partitionBy(f.col('column_name')) in your Window, which kind of works like a groupBy - it groups the data according to a partitioning column. However, without specifying the ordering for all columns, you might arrive at the same problem of non-determinicity. Hope this helps!
For completeness sake, I recommend you have a look at the pyspark doc for the first() and last() functions here: https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.functions.first
In particular, the following note brings light to why you behaviour was non-deterministic:
Note The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle.
Definitely !
import pyspark.sql.functions as F
partition = Window.partitionBy("column1").orderBy("columnN")
data = data.withColumn("max_col2", F.max(F.col("column2")).over(partition))\
.withColumn("first_col3", F.first(F.col("column3")).over(partition))\
.withColumn("last_col4", F.last(F.col("column4")).over(partition))
data.show(10, False)
val df1 = Seq(("Brian", 29, "0-A-1234")).toDF("name", "age", "client-ID")
val df2 = Seq(("1234", 555-5555, "1234 anystreet")).toDF("office-ID", "BusinessNumber", "Address")
I'm trying to run a function on each row of a dataframe (in streaming). This function will contain a combination of scala code, and Spark dataframe api code. for example, I want to take the 3 features from df, and use them to filter a second dataframe called df2. My understanding is that a UDF can't accomplish this. Now I have all the filtering code working just fine, without the ability to apply it to each row of df.
My goal is to be able to do something like
df.select("ID","preferences").map(row => ( //filter df2 using row(0), row(1) and row(3) ))
The dataframes can't be joined, there is not a joinable relationship between them.
Although I'm using Scala, an answer in Java or Python would probably be fine.
I'm also fine with alternative ways of accomplishing this. If I could extract the data from the rows into separate variables (keep in mind this is streaming), that's also fine.
My understanding is that a UDF can't accomplish this.
It is correct, but neither can map (local Datasets seem to be an exception Why does this Spark code make NullPointerException?). A nested logic like this one can be expressed only using joins:
If both Datasets are streaming it has to be equijoin. It means that even though:
The dataframes can't be joined, there is not a joinable relationship between them.
You have to derive one in some way which approximates well filter condition.
If one Dataset is not streaming, you can brute force things with crossJoin followed by filter, but it is of course hardly recommended.
Context
I have an example of event source data in a dataframe input as shown below.
SOURCE
where eventOccurredTime is a String type. This is from the source and I want to retain this in its original string form (with nano sec)
And I want to use the string to enrich some extra date/time typed data for downstream usage. below is an example
TARGET
Now as a one off I can execute some spark sql on the dataframe as shown below to get the result I want:
import org.apache.spark.sql.DataFrame
def transformDF(): DataFrame = {
spark.sql(
s"""
SELECT
id,
struct(
event.eventCategory,
event.eventName,
event.eventOccurredTime,
struct (
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS TIMESTAMP) AS eventOccurredTimestampUTC,
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:mm:ss.SSS") AS DATE) AS eventOccurredDateUTC,
unix_timestamp(substring(event.eventOccurredTime,1,23),"yyyy-MM-dd'T'HH:mm:ss.SSS") * 1000 AS eventOccurredTimestampMillis,
datesDim.dateSeq AS eventOccurredDateDimSeq
) AS eventOccurredTimeDim,
NOTE: This is a snippet, for the full event, I have to do this explicitly in this long SQL 20 times for the 20 string dates
Some things to point out:
unix_timestamp(substring(event.eventOccurredTime,1,23)
Above I found I had to substring a date that had nano precision or would return null, hence the substring
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
above is the pattern / naming convention for the 4 nested xDim struct fields to derive and they are present in the predefined spark schema the json is read using to create the source dataframe.
datesDim.dateSeq AS eventOccurredDateDimSeq
To get the above 'eventOccurredDateDimSeq' field, I need to join to a dates dimensions table 'datesDim' (static with an hourly grain), where dateSeq is the 'key' where this date falls into an hourly bucket where datesDim.UTC is defined to the hour
LEFT OUTER JOIN datesDim ON
CAST(date_format(event.eventOccurredTime,"yyyy-MM-dd'T'HH:00:00") AS TIMESTAMP) = datesDim.UTC
The table is globally available in the spark cluster so should be quick to look up, but I need to do this for every date enrichment in the payloads and they will have different dates.
dateDimensionDF.write.mode("overwrite").saveAsTable("datesDim")
The general schema pattern is that if there is a string date whose field name is:
x
..there is a 'xDim' struct equiv that immediately follows it in schema order below as described.
xDim.xTimestampUTC
xDim.xDateUTC
xDim.xTimestampMillis
xDim.xDateDimSeq
As mentioned with the snippet, although in the image above I am only showing 'eventOccuredTime' in above, there are more of these through the schema, at lower levels too, that need the same transformation pattern applied.
Problem:
So I have the spark sql (the full monty the snippet came from) to do this one off for 1 event type and its a large, explicit SQL statement that applies the time functions and joins I showed), but here is my problem I need help with.
So I want to try and create a more generic, functionally orientated reusable solution, that traverses a nested dataframe and applies this transformation pattern as described above 'where it needs to'
How do define 'where it needs to'?
Perhaps the naming convention is a good start - traverse the DF, look for any struct fields that have the xDim ('Dim' suffix) pattern, and use the 'x' field presceding as the input, and populate the xDim.* values in line with the naming pattern as described?
How in a function to best join on the datesDim registered table (its static remember) so it performs?
Solution?
Think one or more UDF is needed (we use Scala), maybe by itself or as a fragment within SQL, but not sure. Ensuring the DatesDim lookup performs is key I think.
Or maybe there is another way?
Note: I am working with Dataframes / SparkSQL not Datasets, options for each welcomed though?
Databricks
NOTE: Im actually using the databricks platform for this, so for those verse in SQL 'Higher order functions' in Dbricks
https://docs.databricks.com/spark/latest/spark-sql/higher-order-functions-lambda-functions.html
....is there a slick option here using 'TRANSFORM' as a SQL HOF (might need to register a utility UDF and use this with transform perhaps)?
Awesome, thanks spark community for your help!!! Sorry this is a long post setting the scene.
For DataFrame, it is easy to generate a new column with some operation using a udf with df.withColumn("newCol", myUDF("someCol")). To do something like this in Dataset, I guess I would be using the map function:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
You have to pass the entire case class T as input to the function. If the Dataset[T] has a lot of fields/columns, it would seem very inefficient to be passing the entire row if you just wanted to make one extra column by operating on one of the many columns of T. My question is, is Catalyst smart enough to be able to optimize this?
Is Catalyst smart enough to be able to optimize this?
tl;dr No. See SPARK-14083 Analyze JVM bytecode and turn closures into Catalyst expressions.
There's currently no way Spark SQL's Catalyst Optimizer know what you do in your Scala code.
Quoting SPARK-14083:
One big advantage of the Dataset API is the type safety, at the cost of performance due to heavy reliance on user-defined closures/lambdas. These closures are typically slower than expressions because we have more flexibility to optimize expressions (known data types, no virtual function calls, etc). In many cases, it's actually not going to be very difficult to look into the byte code of these closures and figure out what they are trying to do. If we can understand them, then we can turn them directly into Catalyst expressions for more optimized executions.
And there's even your case mentioned:
df.map(_.name) // equivalent to expression col("name")
As you can see it's still open and I doubt anyone works on this currently.
What you could do to help Spark Optimizer is to select that one column and only then use map operator with a one-argument UDF.
That would certainly match your requirements of not passing the entire JVM object to your function, but would not get rid of this slow deserialization from an internal row representation to your Scala object (that would land on the JVM and occupy some space until a GC happens).
I tried to figure myself since I could not find a response anywhere.
Let's have a dataset which contains case classes with multiple fields:
scala> case class A(x: Int, y: Int)
scala> val dfA = spark.createDataset[A](Seq(A(1, 2)))
scala> val dfX = dfA.map(_.x)
Now if we check the optimized plan we get the following:
scala> val plan = dfX.queryExecution.optimizedPlan
SerializeFromObject [input[0, int, true] AS value#8]
+- MapElements <function1>, obj#7: int
+- DeserializeToObject newInstance(class A), obj#6: A
+- LocalRelation [x#2, y#3]
According to the more verbose plan.toJSON the DeserializeToObject step assumes both x and y to be present.
As you proof take for example the following snippet which uses reflection instead of directly touching the fields of A which still works.
val dfX = dfA.map(
_.getClass.getMethods.find(_.getName == "x").get.invoke(x).asInstanceOf[Int]
)