I am using Spark SQL 2.4 queries.
I am using the following sql which is throwing an error: The query is big and has several steps, so I have given a concise version below. When I execute the query, from the Spark-Shell, it fails with the error given below. The explain-plan is rather long so I have trimmed it to a more manageable extent :
I have checked that the values of the partition by encnbr column are fairly unique. However, the Stages tab in Spark UI shows only 1 very lengthy task indicating SKEW. However, since the keys are unique, I'm not sure why this is happening. I have tried using cluster by encnbr in vain.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(79) LocalLimit 4
+- *(79) Project [enc_key#976, prsn_key#951, prov_key#952, clm_key#977, clm_ln_key#978... 7 more fields]
+- Window [lag(non_keys#2862, 1, null) windowspecdefinition(encnbr#2722, eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST, specifiedwindowframe(Rowframe, -1, -1)) AS _we0#2868], [encnbr#2722], [eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST]......
The query consists of several steps with one depending on the result from the previous. However, the step which is failing is something similar to :
select
enc_key,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
case when lag(non_keys) over (partition by encnbr order by eff_dt asc, data_timestamp asc) is null
then 'Y'
when lag(non_keys) <> non_keys
then 'Y'
else 'N'
end as mod_flg
FROM (
select
enc_key,
encnbr,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
eff_dt,
data_timestamp,
md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
from
table1
where encnbr is not null
union all
select
enc_key,
encnbr,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
eff_dt,
data_timestamp,
md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
from
table2
where encnbr is not null
)
Can you please help me alleviate this issue. I have tried using cluster by encnbr in the previous step but it still keeps on failing.
Please help
Thanks.
Related
Whenever I apply a CROSS JOIN to my Databricks SQL query I get a message letting me know that a column does not exists, but I'm not sure if the issue is with CROSS JOIN
For example, code should identify characters such as http, https, ://, / and remove those characters and add a column called websiteurl without the characters aforemention. i.e.
The code is a follows:
SELECT tt.homepage_url
,websiteurl = LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150))
FROM basecrmcbreport.organizations tt
CROSS join (VALUES(SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)))v1(RightString)
However, the above returns the following:
Error in SQL statement: AnalysisException: Column 'homepage_url' does not exist. Did you mean one of the following? []; line 4 pos 31;
'Project ['tt.homepage_url, unresolvedalias(('websiteurl = 'LEFT('v1.RightString, 'COALESCE('NULLIF(('CHARINDEX(/, 'v1.RightString) - 1), -1), 150))), None)]
+- 'Join Cross
:- SubqueryAlias tt
: +- SubqueryAlias spark_catalog.basecrmcbreport.organizations
: +- Relation basecrmcbreport.organizations[uuid#2439,name#2440,type#2441,permalink#2442,cb_url#2443,rank#2444,created_at#2445,updated_at#2446,legal_name#2447,roles#2448,domain#2449,homepage_url#2450,country_code#2451,state_code#2452,region#2453,city#2454,address#2455,postal_code#2456,status#2457,short_description#2458,category_list#2459,category_groups_list#2460,num_funding_rounds#2461,total_funding_usd#2462,... 22 more fields] parquet
+- 'SubqueryAlias v1
+- 'UnresolvedSubqueryColumnAliases [RightString]
+- 'UnresolvedInlineTable [col1], [['SUBSTRING('homepage_url, ('CHARINDEX(//, 'homepage_url) + 2), 150)]]
Can someone let me know how to fix this?
Joins in general are used on tables but not table valued functions as cross apply or outer apply.
The following is the sample data that I have used for demonstration.
When I used the same query as yours, I got the same error.
%sql
SELECT tt.homepage_url
,websiteurl = LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150))
FROM demo tt
CROSS join (VALUES(SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)))v1(RightString)
To resolve this, I have used the select statement to remove the https:// or http:// similar to your logic. Looking at the desired result, I have used inner join (instead of cross join) using like operator for defining condition. The query is as following
%sql
SELECT tt.homepage_url
, LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150)) as website_url
FROM demo tt
inner join (select (SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)) from demo)v1(RightString) where tt.homepage_url like concat('%',v1.RightString)
UPDATE for url with \:
SELECT tt.homepage_url
, LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',replace(v1.RightString,'\\','/'))-1,-1),150)) as website_url
FROM demo tt
inner join (select (SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)) from demo)v1(RightString) on tt.homepage_url like concat('%',v1.RightString,'%') escape '|';
I'm getting an ambiguous column exception when joining on the id column of a dataframe, but there are no duplicate columns in the dataframe. What could be causing this error to be thrown?
Join operation, where a and input have been processed by other functions:
b = (
input
.where(F.col('st').like('%VALUE%'))
.select('id', 'sii')
)
a.join(b, b['id'] == a['item'])
Dataframes:
(Pdb) a.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[item#25280L,sii#24665L]
(Pdb) b.explain()
== Physical Plan ==
*(1) Project [id#23711L, sii#24665L]
+- *(1) Filter (isnotnull(st#25022) AND st#25022 LIKE %VALUE%)
+- *(1) Scan ExistingRDD[id#23711L,st#25022,sii#24665L]
Exception:
pyspark.sql.utils.AnalysisException: Column id#23711L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;
If I recreate the dataframe using the same schema, I do not get any errors:
b_clean = spark_session.createDataFrame([], b.schema)
a.join(b_clean, b_clean['id'] == a['item'])
What can I look at to troubleshoot what happened with the original dataframes that would cause the ambiguous column error?
This error and the fact that your sii column has the same id in both tables (i.e. sii#24665L) tells that both a and b dataframes are made using the same source. So, in essence, this makes your join a self join (exactly what the error message tells). In such cases it's recommended to use alias for dataframes. Try this:
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
Again, in some systems you may not be able to save your result, as the resulting dataframe will have 2 sii columns. I would recommend to explicitly select only the columns that you need. Renaming columns using alias may also help if you decide that you'll need both the duplicate columns. E.g.:
df = (
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
.select('item',
'id',
F.col('a.sii').alias('a_sii')
)
)
I'm using pyspark to perform a join of two tables with a relatively complex join condition (using greater than/smaller than in the join conditions). This works fine, but breaks down as soon as I add a fillna command before the join.
The code looks something like this:
join_cond = [
df_a.col1 == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
df = (
df_a
.fillna('NA', subset=['col1'])
.join(df_b, join_cond, 'left')
)
This results in an error like this:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) col1#4765 missing from col1#6488,col2#4766,col3#4768,colx#4823,coly#4830,colz#4764 in operator !Join LeftOuter, (((col1#4765 = colx#4823) && (col2#4766 = coly#4830)) && (col3#4768 >= colz#4764)). Attribute(s) with the same name appear in the operation: col1. Please check if the right attribute(s) are used.
It looks like spark no longer recognizes col1 after performing the fillna. (The error does not come up if I comment that out.) The problem is that I do need that statement. (And in general I've simplified this example a lot.)
I've looked at this question, but these answers do not work for me. Specifically, using .alias('a') after the fillna doesn't work because then spark does not recognize the a in the join condition.
Could someone:
Explain exactly why this is happening and how I can avoid it in the future?
Advise me on a way to solve it?
Thanks in advance for your help.
What is happening?
In order to "replace" empty values, a new dataframe is created that contains new columns. These new columns have the same names like the old ones but are effectively completely new Spark objects. In the Scala code you can see that the "changed" columns are newly created ones while the original columns are dropped.
A way to see this effect is to call explain on the dataframe before and after replacing the empty values:
df_a.explain()
prints
== Physical Plan ==
*(1) Project [_1#0L AS col1#6L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
while
df_a.fillna(42, subset=['col1']).explain()
prints
== Physical Plan ==
*(1) Project [coalesce(_1#0L, 42) AS col1#27L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
Both plans contain a column called col1, but in the first case the internal representation is called col1#6L while the second one is called col1#27L.
When the join condition df_a.col1 == df_b.colx now is associated with the column col1#6L the join will fail if only the column col1#27L is part of the left table.
How can the problem be solved?
The obvious way would be to move the `fillna` operation before the definition of the join condition:
df_a = df_a.fillna('NA', subset=['col1'])
join_cond = [
df_a.col1 == df_b.colx,
[...]
If this is not possible or wanted you can change the join condition. Instead of using a column from the dataframe (df_a.col1) you can use a column that is not associated with any dataframe by using the col function. This column works only based on its name and therefore ignores when the column is replaced in the dataframe:
from pyspark.sql import functions as F
join_cond = [
F.col("col1") == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
The downside of this second approach is that the column names in both tables must be unique.
I am reading Structured Streaming documentation.
On the one hand, if I get it right, under Policy for handling multiple watermarks they say that if you have different watermarks on two streams then Spark will use for both of them either the minimum value (by default) or the maximum value (if you specify it explicitly) as a global watermark (so Spark will ignore the other one).
On the other hand, under Inner Joins with optional Watermarking they have an example of two streams with different watermarks and they say that for each stream the specified watermark will be used (rather than just the minimum one or maximum one as a global watermark for both).
Perhaps I don't understand what they really try to explain under Policy for handling multiple watermarks, because they say that if you set the multipleWatermarkPolicy to max then the global watermark moves at the pace of the fastest stream, but it should be the complete opposit because a bigger watermark means that the stream is slower.
If as far as I understand, you would like to know how multiple watermarks behave for join operations, right? I so, I did some dig into the implementation to find the answer.
multipleWatermarkPolicy configuration used globally
spark.sql.streaming.multipleWatermarkPolicy property is globally used for all operations involving multiple watermarks and its default is min. You can figure it out by looking at WatermarkTracker#updateWatermark(executedPlan: SparkPlan) method called by MicroBatchExecution#runBatch. And runBatch is invoked by org.apache.spark.sql.execution.streaming.StreamExecution#runStream which is a class responsible for...stream execution ;)
updateWatermarkimplementation
updateWatermark starts by collecting all event-time watermark nodes from physical plan:
val watermarkOperators = executedPlan.collect {
case e: EventTimeWatermarkExec => e
}
if (watermarkOperators.isEmpty) return
watermarkOperators.zipWithIndex.foreach {
case (e, index) if e.eventTimeStats.value.count > 0 =>
logDebug(s"Observed event time stats $index: ${e.eventTimeStats.value}")
val newWatermarkMs = e.eventTimeStats.value.max - e.delayMs
val prevWatermarkMs = operatorToWatermarkMap.get(index)
if (prevWatermarkMs.isEmpty || newWatermarkMs > prevWatermarkMs.get) {
operatorToWatermarkMap.put(index, newWatermarkMs)
}
// Populate 0 if we haven't seen any data yet for this watermark node.
case (_, index) =>
if (!operatorToWatermarkMap.isDefinedAt(index)) {
operatorToWatermarkMap.put(index, 0)
}
}
To get an idea, a physical plan for stream-to-stream join could look like this:
== Physical Plan ==
WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter#6a1dff1d
+- StreamingSymmetricHashJoin [mainKey#10730], [joinedKey#10733], Inner, condition = [ leftOnly = null, rightOnly = null, both = (mainEventTimeWatermark#10732-T4000ms >= joinedEventTimeWatermark#10735-T8000ms), full = (mainEventTimeWatermark#10732-T4000ms >= joinedEventTimeWatermark#10735-T8000ms) ], state info [ checkpoint = file:/tmp/temporary-3416be37-81b4-471a-b2ca-9b8f8593843a/state, runId = 17a4e028-29cb-41b0-b34b-44e20409b335, opId = 0, ver = 13, numPartitions = 200], 389000, state cleanup [ left value predicate: (mainEventTimeWatermark#10732-T4000ms <= 388999000), right = null ]
:- Exchange hashpartitioning(mainKey#10730, 200)
: +- *(2) Filter isnotnull(mainEventTimeWatermark#10732-T4000ms)
: +- EventTimeWatermark mainEventTimeWatermark#10732: timestamp, interval 4 seconds
: +- *(1) Filter isnotnull(mainKey#10730)
: +- *(1) Project [mainKey#10730, mainEventTime#10731L, mainEventTimeWatermark#10732]
: +- *(1) ScanV2 MemoryStreamDataSource$[mainKey#10730, mainEventTime#10731L, mainEventTimeWatermark#10732]
+- Exchange hashpartitioning(joinedKey#10733, 200)
+- *(4) Filter isnotnull(joinedEventTimeWatermark#10735-T8000ms)
+- EventTimeWatermark joinedEventTimeWatermark#10735: timestamp, interval 8 seconds
+- *(3) Filter isnotnull(joinedKey#10733)
+- *(3) Project [joinedKey#10733, joinedEventTime#10734L, joinedEventTimeWatermark#10735]
+- *(3) ScanV2 MemoryStreamDataSource$[joinedKey#10733, joinedEventTime#10734L, joinedEventTimeWatermark#10735]
Later, updateWatermark uses one of available watermark policies which are MinWatermark and MaxWatermark, depending on the value you set in spark.sql.streaming.multipleWatermarkPolicy. It's resolved that way in MultipleWatermarkPolicy companion object:
def apply(policyName: String): MultipleWatermarkPolicy = {
policyName.toLowerCase match {
case DEFAULT_POLICY_NAME => MinWatermark
case "max" => MaxWatermark
case _ =>
throw new IllegalArgumentException(s"Could not recognize watermark policy '$policyName'")
}
}
updateWatermark uses the resolved policy to compute the watermark to apply on the query:
// Update the global watermark to the minimum of all watermark nodes.
// This is the safest option, because only the global watermark is fault-tolerant. Making
// it the minimum of all individual watermarks guarantees it will never advance past where
// any individual watermark operator would be if it were in a plan by itself.
val chosenGlobalWatermark = policy.chooseGlobalWatermark(operatorToWatermarkMap.values.toSeq)
if (chosenGlobalWatermark > globalWatermarkMs) {
logInfo(s"Updating event-time watermark from $globalWatermarkMs to $chosenGlobalWatermark ms")
globalWatermarkMs = chosenGlobalWatermark
} else {
logDebug(s"Event time watermark didn't move: $chosenGlobalWatermark < $globalWatermarkMs")
}
Misc
However, I agree that the comment in previous snippet is a little bit misleading since it says about "Update the global watermark to the minimum of all watermark nodes." (https://github.com/apache/spark/blob/v2.4.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/WatermarkTracker.scala#L109)
The behavior on multiple watermarks is also asserted in EventTimeWatermarkSuite. Even though it applies to a UNION, you saw in the 2 first parts that the watermark is updated the same way for all combination operations.
To debug in your own, please check the following entries in the logs:
[2019-07-05 08:30:09,729] org.apache.spark.internal.Logging$class INFO Streaming query made progress - returns all information about each executed query. In its eventTime part you will find the watermark property that should be different if you execute the same query with min and max multipleWatermarkPolicy
[2019-07-05 08:30:35,685] org.apache.spark.internal.Logging$class INFO Updating event-time watermark from 0 to 6000 ms (org.apache.spark.sql.execution.streaming.WatermarkTracker:54) - says that the watermark has just changed. As previously, should be different according to min/max property.
So to wrap-up, starting from 2.4.0, we can chose one watermark (min or max). Prior to 2.4.0, min watermark was a default choice (SPARK-24730). And so independently on the operation type (inner join, outer join, ...) because the watermark resolution method is the same for all queries.
I have a connected graph Like this
user1|A,C,B
user2|A,E,B,A
user3|C,B,A,B,E
user4|A,C,B,E,B
where user are the property name and the path for that particular user is followed. For example for
user1 the path is A->C->B
user2: A->E->B->A
user3: C->B->A->B->E
user4: A->C->B->E->B
Now, I want to find all users who have reached from A to E. Output should be
user2, user3, user4(since all these users finally reached E from A, no matter how many hops they took). How can I write the motif for this.
This is what I tried.
val vertices=spark.createDataFrame(List(("A","Billing"),("B","Devices"),("C","Payment"),("D","Data"),("E","Help"))).toDF("id","desc")
val edges = spark.createDataFrame(List(("A","C","user1"),
("C","B","user1"),
("A","E","user2"),
("E","B","user2"),
("B","A","user2"),
("C","B","user3"),
("B","A","user3"),
("A","B","user3"),
("B","E","user3"),
("A","C","user4"),
("C","B","user4"),
("B","E","user4"),
("E","B","user4"))).toDF("src","dst","user")
val pathAnalysis=GraphFrame(vertices,edges)
pathAnalysis.find("(a)-[]->();()-[]->();()-[]->(d)").filter("a.id='A'").filter("d.id='E'").distinct().show()
But I am getting an exception like this
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join Inner
:- Project [a#355]
: +- Join Inner, (__tmp-4363599943432734077#353.src = a#355.id)
: :- LocalRelation [__tmp-4363599943432734077#353]
: +- Project [named_struct(id, _1#0, desc, _2#1) AS a#355]
: +- Filter (named_struct(id, _1#0, desc, _2#1).id = A)
: +- LocalRelation [_1#0, _2#1]
+- LocalRelation
and
LocalRelation [__tmp-1043886091038848698#371]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
I am not sure if my condition is correct or how to set this property
spark.sql.crossJoin.enabled=true on a spark-shell
I invoked my spark-shell as follows
spark-shell --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11
my suggested solution is kinda trivial, but it will work fine if the paths are relatively short, and the number of user (i.e., number of rows in the data set) is big. If this is not the case, please let me know, other implementations are possible.
case class UserPath(
userId: String,
path: List[String])
val dsUsers = Seq(
UserPath("user1", List("A", "B", "C")),
UserPath("user2", List("A", "E", "B", "A")))
.doDF.as[UserPath]
def pathExists(up: UserPath): Option[String] = {
val prefix = up.path.takeWhile(s => s != "A")
val len = up.path.length
if (up.path.takeRight(len - prefix.length).contains("E"))
Some(up.userId)
else
None
}
// Users with path from A -> E.
dsUsers.map(pathExists).filter(opt => !opt.isEmpty)
You can also use BFS algorithm for it: http://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.lib.BFS
With your data model you will have to iterate over users and run the BFS for each of them like this:
scala> pathAnalysis.bfs.fromExpr($"id" === "A").toExpr($"id" === "E").edgeFilter($"user" === "user3").run().show
+------------+-------------+------------+-------------+---------+
| from| e0| v1| e1| to|
+------------+-------------+------------+-------------+---------+
|[A, Billing]|[A, B, user3]|[B, Devices]|[B, E, user3]|[E, Help]|
+------------+-------------+------------+-------------+---------+