Databricks CROSS JOIN failing to identify column - databricks

Whenever I apply a CROSS JOIN to my Databricks SQL query I get a message letting me know that a column does not exists, but I'm not sure if the issue is with CROSS JOIN
For example, code should identify characters such as http, https, ://, / and remove those characters and add a column called websiteurl without the characters aforemention. i.e.
The code is a follows:
SELECT tt.homepage_url
,websiteurl = LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150))
FROM basecrmcbreport.organizations tt
CROSS join (VALUES(SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)))v1(RightString)
However, the above returns the following:
Error in SQL statement: AnalysisException: Column 'homepage_url' does not exist. Did you mean one of the following? []; line 4 pos 31;
'Project ['tt.homepage_url, unresolvedalias(('websiteurl = 'LEFT('v1.RightString, 'COALESCE('NULLIF(('CHARINDEX(/, 'v1.RightString) - 1), -1), 150))), None)]
+- 'Join Cross
:- SubqueryAlias tt
: +- SubqueryAlias spark_catalog.basecrmcbreport.organizations
: +- Relation basecrmcbreport.organizations[uuid#2439,name#2440,type#2441,permalink#2442,cb_url#2443,rank#2444,created_at#2445,updated_at#2446,legal_name#2447,roles#2448,domain#2449,homepage_url#2450,country_code#2451,state_code#2452,region#2453,city#2454,address#2455,postal_code#2456,status#2457,short_description#2458,category_list#2459,category_groups_list#2460,num_funding_rounds#2461,total_funding_usd#2462,... 22 more fields] parquet
+- 'SubqueryAlias v1
+- 'UnresolvedSubqueryColumnAliases [RightString]
+- 'UnresolvedInlineTable [col1], [['SUBSTRING('homepage_url, ('CHARINDEX(//, 'homepage_url) + 2), 150)]]
Can someone let me know how to fix this?

Joins in general are used on tables but not table valued functions as cross apply or outer apply.
The following is the sample data that I have used for demonstration.
When I used the same query as yours, I got the same error.
%sql
SELECT tt.homepage_url
,websiteurl = LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150))
FROM demo tt
CROSS join (VALUES(SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)))v1(RightString)
To resolve this, I have used the select statement to remove the https:// or http:// similar to your logic. Looking at the desired result, I have used inner join (instead of cross join) using like operator for defining condition. The query is as following
%sql
SELECT tt.homepage_url
, LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150)) as website_url
FROM demo tt
inner join (select (SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)) from demo)v1(RightString) where tt.homepage_url like concat('%',v1.RightString)
UPDATE for url with \:
SELECT tt.homepage_url
, LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',replace(v1.RightString,'\\','/'))-1,-1),150)) as website_url
FROM demo tt
inner join (select (SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)) from demo)v1(RightString) on tt.homepage_url like concat('%',v1.RightString,'%') escape '|';

Related

Databricks AnalysisException: Column 'l' does not exist

I have a very strange occurrence with my code.
I keep on getting the error
AnalysisException: Column 'homepage_url' does not exist
However, when I do a select with cross Joins the column does actually exist.
Can someone take a look at my cross joins and let me know if that is where the problem is
SELECT DISTINCT
account.xpd_relationshipstatus AS CRM_xpd_relationshipstatus
,REPLACE(owneridname,'Data.Import #','') AS MontaguOwner
,account.ts_montaguoffice AS Montagu_Office
,CAST(account.ts_reminderdatesetto AS DATE) AS CRM_ts_reminderdatesetto
,CAST(account.ts_lastdatestatuschanged AS DATE) AS YearofCRMtslastdatestatuschanged
,organizations.name AS nameCB
,organizations.homepage_url
,iff(e like 'www.%', e, 'www.' + e) AS website
,left(category_list,charindex(',',category_list +',' )-1) AS category_CB
-- ,case when charindex(',',category_list,0) > 0 then left(category_list,charindex(',',category_list)-1) else category_list end as category_CB
,organizations.category_groups_list AS category_groups_CB
FROM basecrmcbreport.account
LEFT OUTER JOIN basecrmcbreport.CRM2CBURL_Lookup
ON account.Id = CRM2CBURL_Lookup.Key
LEFT OUTER JOIN basecrmcbreport.organizations
ON CRM2CBURL_Lookup.CB_URL_KEY = organizations.cb_url
cross Join (values (charindex('://', homepage_url))) a(a)
cross Join (values (iff(a = 0, 1, a + 3))) b(b)
cross Join (values (charindex('/', homepage_url, b))) c(c)
cross Join (values (iff(c = 0, length(homepage_url) + 1, c))) d(d)
cross Join (values (substring(homepage_url, b, d - b))) e(e)
Without the cross Joins
The main reason for cross join (or any join) to recognize the column when you select not when using table valued functions is that joins are used on tables only.
To use table valued functions, one must use cross apply or outer apply. But these are not supported in Databricks sql.
The following is the demo data I am using:
I tried using inner join on a table valued function using the following query and got the same error:
select d1.*,a from demo1 inner join (values(if(d1.team = 'OG',2,1))) a;
Instead, using the select query, the joins work as that is how they function:
select d1.*,a.no_of_wins from demo1 d1 inner join (select id,case team when 'OG' then 2 when 'TS' then 1 end as no_of_wins from demo1) a on d1.id=a.id;
So, the remedy for this problem is to replace all the table valued functions on which you are using joins with SELECT statements.

pyspark.sql.utils.AnalysisException: Column ambiguous but no duplicate column names

I'm getting an ambiguous column exception when joining on the id column of a dataframe, but there are no duplicate columns in the dataframe. What could be causing this error to be thrown?
Join operation, where a and input have been processed by other functions:
b = (
input
.where(F.col('st').like('%VALUE%'))
.select('id', 'sii')
)
a.join(b, b['id'] == a['item'])
Dataframes:
(Pdb) a.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[item#25280L,sii#24665L]
(Pdb) b.explain()
== Physical Plan ==
*(1) Project [id#23711L, sii#24665L]
+- *(1) Filter (isnotnull(st#25022) AND st#25022 LIKE %VALUE%)
+- *(1) Scan ExistingRDD[id#23711L,st#25022,sii#24665L]
Exception:
pyspark.sql.utils.AnalysisException: Column id#23711L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;
If I recreate the dataframe using the same schema, I do not get any errors:
b_clean = spark_session.createDataFrame([], b.schema)
a.join(b_clean, b_clean['id'] == a['item'])
What can I look at to troubleshoot what happened with the original dataframes that would cause the ambiguous column error?
This error and the fact that your sii column has the same id in both tables (i.e. sii#24665L) tells that both a and b dataframes are made using the same source. So, in essence, this makes your join a self join (exactly what the error message tells). In such cases it's recommended to use alias for dataframes. Try this:
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
Again, in some systems you may not be able to save your result, as the resulting dataframe will have 2 sii columns. I would recommend to explicitly select only the columns that you need. Renaming columns using alias may also help if you decide that you'll need both the duplicate columns. E.g.:
df = (
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
.select('item',
'id',
F.col('a.sii').alias('a_sii')
)
)

Error: Resolved attributes missing in join

I'm using pyspark to perform a join of two tables with a relatively complex join condition (using greater than/smaller than in the join conditions). This works fine, but breaks down as soon as I add a fillna command before the join.
The code looks something like this:
join_cond = [
df_a.col1 == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
df = (
df_a
.fillna('NA', subset=['col1'])
.join(df_b, join_cond, 'left')
)
This results in an error like this:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) col1#4765 missing from col1#6488,col2#4766,col3#4768,colx#4823,coly#4830,colz#4764 in operator !Join LeftOuter, (((col1#4765 = colx#4823) && (col2#4766 = coly#4830)) && (col3#4768 >= colz#4764)). Attribute(s) with the same name appear in the operation: col1. Please check if the right attribute(s) are used.
It looks like spark no longer recognizes col1 after performing the fillna. (The error does not come up if I comment that out.) The problem is that I do need that statement. (And in general I've simplified this example a lot.)
I've looked at this question, but these answers do not work for me. Specifically, using .alias('a') after the fillna doesn't work because then spark does not recognize the a in the join condition.
Could someone:
Explain exactly why this is happening and how I can avoid it in the future?
Advise me on a way to solve it?
Thanks in advance for your help.
What is happening?
In order to "replace" empty values, a new dataframe is created that contains new columns. These new columns have the same names like the old ones but are effectively completely new Spark objects. In the Scala code you can see that the "changed" columns are newly created ones while the original columns are dropped.
A way to see this effect is to call explain on the dataframe before and after replacing the empty values:
df_a.explain()
prints
== Physical Plan ==
*(1) Project [_1#0L AS col1#6L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
while
df_a.fillna(42, subset=['col1']).explain()
prints
== Physical Plan ==
*(1) Project [coalesce(_1#0L, 42) AS col1#27L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
Both plans contain a column called col1, but in the first case the internal representation is called col1#6L while the second one is called col1#27L.
When the join condition df_a.col1 == df_b.colx now is associated with the column col1#6L the join will fail if only the column col1#27L is part of the left table.
How can the problem be solved?
The obvious way would be to move the `fillna` operation before the definition of the join condition:
df_a = df_a.fillna('NA', subset=['col1'])
join_cond = [
df_a.col1 == df_b.colx,
[...]
If this is not possible or wanted you can change the join condition. Instead of using a column from the dataframe (df_a.col1) you can use a column that is not associated with any dataframe by using the col function. This column works only based on its name and therefore ignores when the column is replaced in the dataframe:
from pyspark.sql import functions as F
join_cond = [
F.col("col1") == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
The downside of this second approach is that the column names in both tables must be unique.

Exchange SinglePartition Error in SparkSQL Query

I am using Spark SQL 2.4 queries.
I am using the following sql which is throwing an error: The query is big and has several steps, so I have given a concise version below. When I execute the query, from the Spark-Shell, it fails with the error given below. The explain-plan is rather long so I have trimmed it to a more manageable extent :
I have checked that the values of the partition by encnbr column are fairly unique. However, the Stages tab in Spark UI shows only 1 very lengthy task indicating SKEW. However, since the keys are unique, I'm not sure why this is happening. I have tried using cluster by encnbr in vain.
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange SinglePartition
+- *(79) LocalLimit 4
+- *(79) Project [enc_key#976, prsn_key#951, prov_key#952, clm_key#977, clm_ln_key#978... 7 more fields]
+- Window [lag(non_keys#2862, 1, null) windowspecdefinition(encnbr#2722, eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST, specifiedwindowframe(Rowframe, -1, -1)) AS _we0#2868], [encnbr#2722], [eff_dt#713 ASC NULLS FIRST, data_timestamp#2723 ASC NULLS FIRST]......
The query consists of several steps with one depending on the result from the previous. However, the step which is failing is something similar to :
select
enc_key,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
case when lag(non_keys) over (partition by encnbr order by eff_dt asc, data_timestamp asc) is null
then 'Y'
when lag(non_keys) <> non_keys
then 'Y'
else 'N'
end as mod_flg
FROM (
select
enc_key,
encnbr,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
eff_dt,
data_timestamp,
md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
from
table1
where encnbr is not null
union all
select
enc_key,
encnbr,
prsn_key,
prov_key,
clm_key,
clm_ln_key
birth_dt,
eff_dt,
data_timestamp,
md5(enc_key || prsn_key || prov_key || clm_key || clm_ln_key) as non_keys
from
table2
where encnbr is not null
)
Can you please help me alleviate this issue. I have tried using cluster by encnbr in the previous step but it still keeps on failing.
Please help
Thanks.

Finding path between 2 vertices which are not directly connected

I have a connected graph Like this
user1|A,C,B
user2|A,E,B,A
user3|C,B,A,B,E
user4|A,C,B,E,B
where user are the property name and the path for that particular user is followed. For example for
user1 the path is A->C->B
user2: A->E->B->A
user3: C->B->A->B->E
user4: A->C->B->E->B
Now, I want to find all users who have reached from A to E. Output should be
user2, user3, user4(since all these users finally reached E from A, no matter how many hops they took). How can I write the motif for this.
This is what I tried.
val vertices=spark.createDataFrame(List(("A","Billing"),("B","Devices"),("C","Payment"),("D","Data"),("E","Help"))).toDF("id","desc")
val edges = spark.createDataFrame(List(("A","C","user1"),
("C","B","user1"),
("A","E","user2"),
("E","B","user2"),
("B","A","user2"),
("C","B","user3"),
("B","A","user3"),
("A","B","user3"),
("B","E","user3"),
("A","C","user4"),
("C","B","user4"),
("B","E","user4"),
("E","B","user4"))).toDF("src","dst","user")
val pathAnalysis=GraphFrame(vertices,edges)
pathAnalysis.find("(a)-[]->();()-[]->();()-[]->(d)").filter("a.id='A'").filter("d.id='E'").distinct().show()
But I am getting an exception like this
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join Inner
:- Project [a#355]
: +- Join Inner, (__tmp-4363599943432734077#353.src = a#355.id)
: :- LocalRelation [__tmp-4363599943432734077#353]
: +- Project [named_struct(id, _1#0, desc, _2#1) AS a#355]
: +- Filter (named_struct(id, _1#0, desc, _2#1).id = A)
: +- LocalRelation [_1#0, _2#1]
+- LocalRelation
and
LocalRelation [__tmp-1043886091038848698#371]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
I am not sure if my condition is correct or how to set this property
spark.sql.crossJoin.enabled=true on a spark-shell
I invoked my spark-shell as follows
spark-shell --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11
my suggested solution is kinda trivial, but it will work fine if the paths are relatively short, and the number of user (i.e., number of rows in the data set) is big. If this is not the case, please let me know, other implementations are possible.
case class UserPath(
userId: String,
path: List[String])
val dsUsers = Seq(
UserPath("user1", List("A", "B", "C")),
UserPath("user2", List("A", "E", "B", "A")))
.doDF.as[UserPath]
def pathExists(up: UserPath): Option[String] = {
val prefix = up.path.takeWhile(s => s != "A")
val len = up.path.length
if (up.path.takeRight(len - prefix.length).contains("E"))
Some(up.userId)
else
None
}
// Users with path from A -> E.
dsUsers.map(pathExists).filter(opt => !opt.isEmpty)
You can also use BFS algorithm for it: http://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.lib.BFS
With your data model you will have to iterate over users and run the BFS for each of them like this:
scala> pathAnalysis.bfs.fromExpr($"id" === "A").toExpr($"id" === "E").edgeFilter($"user" === "user3").run().show
+------------+-------------+------------+-------------+---------+
| from| e0| v1| e1| to|
+------------+-------------+------------+-------------+---------+
|[A, Billing]|[A, B, user3]|[B, Devices]|[B, E, user3]|[E, Help]|
+------------+-------------+------------+-------------+---------+

Resources