Finding path between 2 vertices which are not directly connected - apache-spark

I have a connected graph Like this
user1|A,C,B
user2|A,E,B,A
user3|C,B,A,B,E
user4|A,C,B,E,B
where user are the property name and the path for that particular user is followed. For example for
user1 the path is A->C->B
user2: A->E->B->A
user3: C->B->A->B->E
user4: A->C->B->E->B
Now, I want to find all users who have reached from A to E. Output should be
user2, user3, user4(since all these users finally reached E from A, no matter how many hops they took). How can I write the motif for this.
This is what I tried.
val vertices=spark.createDataFrame(List(("A","Billing"),("B","Devices"),("C","Payment"),("D","Data"),("E","Help"))).toDF("id","desc")
val edges = spark.createDataFrame(List(("A","C","user1"),
("C","B","user1"),
("A","E","user2"),
("E","B","user2"),
("B","A","user2"),
("C","B","user3"),
("B","A","user3"),
("A","B","user3"),
("B","E","user3"),
("A","C","user4"),
("C","B","user4"),
("B","E","user4"),
("E","B","user4"))).toDF("src","dst","user")
val pathAnalysis=GraphFrame(vertices,edges)
pathAnalysis.find("(a)-[]->();()-[]->();()-[]->(d)").filter("a.id='A'").filter("d.id='E'").distinct().show()
But I am getting an exception like this
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
Join Inner
:- Project [a#355]
: +- Join Inner, (__tmp-4363599943432734077#353.src = a#355.id)
: :- LocalRelation [__tmp-4363599943432734077#353]
: +- Project [named_struct(id, _1#0, desc, _2#1) AS a#355]
: +- Filter (named_struct(id, _1#0, desc, _2#1).id = A)
: +- LocalRelation [_1#0, _2#1]
+- LocalRelation
and
LocalRelation [__tmp-1043886091038848698#371]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
I am not sure if my condition is correct or how to set this property
spark.sql.crossJoin.enabled=true on a spark-shell
I invoked my spark-shell as follows
spark-shell --packages graphframes:graphframes:0.3.0-spark2.0-s_2.11

my suggested solution is kinda trivial, but it will work fine if the paths are relatively short, and the number of user (i.e., number of rows in the data set) is big. If this is not the case, please let me know, other implementations are possible.
case class UserPath(
userId: String,
path: List[String])
val dsUsers = Seq(
UserPath("user1", List("A", "B", "C")),
UserPath("user2", List("A", "E", "B", "A")))
.doDF.as[UserPath]
def pathExists(up: UserPath): Option[String] = {
val prefix = up.path.takeWhile(s => s != "A")
val len = up.path.length
if (up.path.takeRight(len - prefix.length).contains("E"))
Some(up.userId)
else
None
}
// Users with path from A -> E.
dsUsers.map(pathExists).filter(opt => !opt.isEmpty)

You can also use BFS algorithm for it: http://graphframes.github.io/graphframes/docs/_site/api/scala/index.html#org.graphframes.lib.BFS
With your data model you will have to iterate over users and run the BFS for each of them like this:
scala> pathAnalysis.bfs.fromExpr($"id" === "A").toExpr($"id" === "E").edgeFilter($"user" === "user3").run().show
+------------+-------------+------------+-------------+---------+
| from| e0| v1| e1| to|
+------------+-------------+------------+-------------+---------+
|[A, Billing]|[A, B, user3]|[B, Devices]|[B, E, user3]|[E, Help]|
+------------+-------------+------------+-------------+---------+

Related

Databricks CROSS JOIN failing to identify column

Whenever I apply a CROSS JOIN to my Databricks SQL query I get a message letting me know that a column does not exists, but I'm not sure if the issue is with CROSS JOIN
For example, code should identify characters such as http, https, ://, / and remove those characters and add a column called websiteurl without the characters aforemention. i.e.
The code is a follows:
SELECT tt.homepage_url
,websiteurl = LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150))
FROM basecrmcbreport.organizations tt
CROSS join (VALUES(SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)))v1(RightString)
However, the above returns the following:
Error in SQL statement: AnalysisException: Column 'homepage_url' does not exist. Did you mean one of the following? []; line 4 pos 31;
'Project ['tt.homepage_url, unresolvedalias(('websiteurl = 'LEFT('v1.RightString, 'COALESCE('NULLIF(('CHARINDEX(/, 'v1.RightString) - 1), -1), 150))), None)]
+- 'Join Cross
:- SubqueryAlias tt
: +- SubqueryAlias spark_catalog.basecrmcbreport.organizations
: +- Relation basecrmcbreport.organizations[uuid#2439,name#2440,type#2441,permalink#2442,cb_url#2443,rank#2444,created_at#2445,updated_at#2446,legal_name#2447,roles#2448,domain#2449,homepage_url#2450,country_code#2451,state_code#2452,region#2453,city#2454,address#2455,postal_code#2456,status#2457,short_description#2458,category_list#2459,category_groups_list#2460,num_funding_rounds#2461,total_funding_usd#2462,... 22 more fields] parquet
+- 'SubqueryAlias v1
+- 'UnresolvedSubqueryColumnAliases [RightString]
+- 'UnresolvedInlineTable [col1], [['SUBSTRING('homepage_url, ('CHARINDEX(//, 'homepage_url) + 2), 150)]]
Can someone let me know how to fix this?
Joins in general are used on tables but not table valued functions as cross apply or outer apply.
The following is the sample data that I have used for demonstration.
When I used the same query as yours, I got the same error.
%sql
SELECT tt.homepage_url
,websiteurl = LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150))
FROM demo tt
CROSS join (VALUES(SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)))v1(RightString)
To resolve this, I have used the select statement to remove the https:// or http:// similar to your logic. Looking at the desired result, I have used inner join (instead of cross join) using like operator for defining condition. The query is as following
%sql
SELECT tt.homepage_url
, LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',v1.RightString)-1,-1),150)) as website_url
FROM demo tt
inner join (select (SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)) from demo)v1(RightString) where tt.homepage_url like concat('%',v1.RightString)
UPDATE for url with \:
SELECT tt.homepage_url
, LEFT(v1.RightString,COALESCE(NULLIF(CHARINDEX('/',replace(v1.RightString,'\\','/'))-1,-1),150)) as website_url
FROM demo tt
inner join (select (SUBSTRING(homepage_url,CHARINDEX('//',homepage_url)+2,150)) from demo)v1(RightString) on tt.homepage_url like concat('%',v1.RightString,'%') escape '|';

pyspark.sql.utils.AnalysisException: Column ambiguous but no duplicate column names

I'm getting an ambiguous column exception when joining on the id column of a dataframe, but there are no duplicate columns in the dataframe. What could be causing this error to be thrown?
Join operation, where a and input have been processed by other functions:
b = (
input
.where(F.col('st').like('%VALUE%'))
.select('id', 'sii')
)
a.join(b, b['id'] == a['item'])
Dataframes:
(Pdb) a.explain()
== Physical Plan ==
*(1) Scan ExistingRDD[item#25280L,sii#24665L]
(Pdb) b.explain()
== Physical Plan ==
*(1) Project [id#23711L, sii#24665L]
+- *(1) Filter (isnotnull(st#25022) AND st#25022 LIKE %VALUE%)
+- *(1) Scan ExistingRDD[id#23711L,st#25022,sii#24665L]
Exception:
pyspark.sql.utils.AnalysisException: Column id#23711L are ambiguous. It's probably because you joined several Datasets together, and some of these Datasets are the same. This column points to one of the Datasets but Spark is unable to figure out which one. Please alias the Datasets with different names via Dataset.as before joining them, and specify the column using qualified name, e.g. df.as("a").join(df.as("b"), $"a.id" > $"b.id"). You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable this check.;
If I recreate the dataframe using the same schema, I do not get any errors:
b_clean = spark_session.createDataFrame([], b.schema)
a.join(b_clean, b_clean['id'] == a['item'])
What can I look at to troubleshoot what happened with the original dataframes that would cause the ambiguous column error?
This error and the fact that your sii column has the same id in both tables (i.e. sii#24665L) tells that both a and b dataframes are made using the same source. So, in essence, this makes your join a self join (exactly what the error message tells). In such cases it's recommended to use alias for dataframes. Try this:
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
Again, in some systems you may not be able to save your result, as the resulting dataframe will have 2 sii columns. I would recommend to explicitly select only the columns that you need. Renaming columns using alias may also help if you decide that you'll need both the duplicate columns. E.g.:
df = (
a.alias('a').join(b.alias('b'), F.col('b.id') == F.col('a.item'))
.select('item',
'id',
F.col('a.sii').alias('a_sii')
)
)

Error: Resolved attributes missing in join

I'm using pyspark to perform a join of two tables with a relatively complex join condition (using greater than/smaller than in the join conditions). This works fine, but breaks down as soon as I add a fillna command before the join.
The code looks something like this:
join_cond = [
df_a.col1 == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
df = (
df_a
.fillna('NA', subset=['col1'])
.join(df_b, join_cond, 'left')
)
This results in an error like this:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) col1#4765 missing from col1#6488,col2#4766,col3#4768,colx#4823,coly#4830,colz#4764 in operator !Join LeftOuter, (((col1#4765 = colx#4823) && (col2#4766 = coly#4830)) && (col3#4768 >= colz#4764)). Attribute(s) with the same name appear in the operation: col1. Please check if the right attribute(s) are used.
It looks like spark no longer recognizes col1 after performing the fillna. (The error does not come up if I comment that out.) The problem is that I do need that statement. (And in general I've simplified this example a lot.)
I've looked at this question, but these answers do not work for me. Specifically, using .alias('a') after the fillna doesn't work because then spark does not recognize the a in the join condition.
Could someone:
Explain exactly why this is happening and how I can avoid it in the future?
Advise me on a way to solve it?
Thanks in advance for your help.
What is happening?
In order to "replace" empty values, a new dataframe is created that contains new columns. These new columns have the same names like the old ones but are effectively completely new Spark objects. In the Scala code you can see that the "changed" columns are newly created ones while the original columns are dropped.
A way to see this effect is to call explain on the dataframe before and after replacing the empty values:
df_a.explain()
prints
== Physical Plan ==
*(1) Project [_1#0L AS col1#6L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
while
df_a.fillna(42, subset=['col1']).explain()
prints
== Physical Plan ==
*(1) Project [coalesce(_1#0L, 42) AS col1#27L, _2#1L AS col2#7L, _3#2L AS col3#8L]
+- *(1) Scan ExistingRDD[_1#0L,_2#1L,_3#2L]
Both plans contain a column called col1, but in the first case the internal representation is called col1#6L while the second one is called col1#27L.
When the join condition df_a.col1 == df_b.colx now is associated with the column col1#6L the join will fail if only the column col1#27L is part of the left table.
How can the problem be solved?
The obvious way would be to move the `fillna` operation before the definition of the join condition:
df_a = df_a.fillna('NA', subset=['col1'])
join_cond = [
df_a.col1 == df_b.colx,
[...]
If this is not possible or wanted you can change the join condition. Instead of using a column from the dataframe (df_a.col1) you can use a column that is not associated with any dataframe by using the col function. This column works only based on its name and therefore ignores when the column is replaced in the dataframe:
from pyspark.sql import functions as F
join_cond = [
F.col("col1") == df_b.colx,
df_a.col2 == df_b.coly,
df_a.col3 >= df_b.colz
]
The downside of this second approach is that the column names in both tables must be unique.

Modifying query using Spark Catalyst Logical Plan

Is it possible to add/replace existing column expression in
DataFrame API/SQL using extension point.
Ex: assume we inject resolution rule which could check the project
node from the plan and on checking for column "name", replace it
with upper(name) for instance.
Is such a thing possible using Extension Points. The examples which i have
found are mostly simple, which do not manipulate the input expressions in the manner i need.
Kindly let me know if this is possible.
Yes this is possible.
Lets take an example. Suppose we want to write a rule which checks for Project operator and if the project is for some particular column (say 'column2'), then it multiply it by 2.
import org.apache.spark.sql.catalyst.plans.logical._
import org.apache.spark.sql.catalyst.rules.Rule
import org.apache.spark.sql.catalyst.expressions._
import org.apache.spark.sql.catalyst.plans._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
object DoubleColumn2OptimizationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transform {
case p: Project =>
if (p.projectList.filter(_.name == "column2").size >= 1) {
val newList = p.projectList.map { case x =>
if (x.name == "column2") {
Alias(Multiply(Literal(2, IntegerType), x), "column2_doubled")()
} else {
x
}
}
p.copy(projectList = newList)
} else {
p
}
}
}
say we have a table "table1" which has two columns - column1, column2.
Without this rule -
> spark.sql("select column2 from table1 limit 10").collect()
Array([1], [2], [3], [4], [5], [6], [7], [8], [9], [10])
with this rule -
> spark.experimental.extraOptimizations = Seq(DoubleColumn2OptimizationRule)
> spark.sql("select column2 from table1 limit 10").collect()
Array([2], [4], [6], [8], [10], [12], [14], [16], [18], [20])
Also you can call explain on DataFrame to check the plan -
> spark.sql("select column2 from table1 limit 10").explain
== Physical Plan ==
CollectLimit 10
+- *(1) LocalLimit 10
+- *(1) Project [(2 * column2#213) AS column2_doubled#214]
+- HiveTableScan [column2#213], HiveTableRelation `default`.`table1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [column1#212, column2#213]

Transforming Spark SQL AST with extraOptimizations

I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query.
I was hoping to achieve this by hooking into Catalyst using sparkSession.experimental.extraOptimizations. I know that what I'm attempting isn't strictly speaking an optimisation (the transformation changes the semantics of the SQL statement), but the API still seems suitable. However, my transformation seems to be ignored by the query executor.
Here is a minimal example to illustrate the issue I'm having. First define a row case class:
case class TestRow(a: Int, b: Int, c: Int)
Then define an optimisation rule which simply discards any projection:
object RemoveProjectOptimisationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case x: Project => x.child
}
}
Now create a dataset, register the optimisation, and run a SQL query:
// Create a dataset and register table.
val dataset = List(TestRow(1, 2, 3)).toDS()
val tableName: String = "testtable"
dataset.createOrReplaceTempView(tableName)
// Register "optimisation".
sparkSession.experimental.extraOptimizations =
Seq(RemoveProjectOptimisationRule)
// Run query.
val projected = sqlContext.sql("SELECT a FROM " + tableName + " WHERE a = 1")
// Print query result and the queryExecution object.
println("Query result:")
projected.collect.foreach(println)
println(projected.queryExecution)
Here is the output:
Query result:
[1]
== Parsed Logical Plan ==
'Project ['a]
+- 'Filter ('a = 1)
+- 'UnresolvedRelation `testtable`
== Analyzed Logical Plan ==
a: int
Project [a#3]
+- Filter (a#3 = 1)
+- SubqueryAlias testtable
+- LocalRelation [a#3, b#4, c#5]
== Optimized Logical Plan ==
Filter (a#3 = 1)
+- LocalRelation [a#3, b#4, c#5]
== Physical Plan ==
*Filter (a#3 = 1)
+- LocalTableScan [a#3, b#4, c#5]
We see that the result is identical to that of the original SQL statement, without the transformation applied. Yet, when printing the logical and physical plans, the projection has indeed been removed. I've also confirmed (through debug log output) that the transformation is indeed being invoked.
Any suggestions as to what's going on here? Maybe the optimiser simply ignores "optimisations" that change semantics?
If using the optimisations isn't the way to go, can anybody suggest an alternative? All I really want to do is parse the input SQL statement, transform it, and pass the transformed AST to Spark for execution. But as far as I can see, the APIs for doing this are private to the Spark sql package. It may be possible to use reflection, but I'd like to avoid that.
Any pointers would be much appreciated.
As you guessed, this is failing to work because we make assumptions that the optimizer will not change the results of the query.
Specifically, we cache the schema that comes out of the analyzer (and assume the optimizer does not change it). When translating rows to the external format, we use this schema and thus are truncating the columns in the result. If you did more than truncate (i.e. changed datatypes) this might even crash.
As you can see in this notebook, it is in fact producing the result you would expect under the covers. We are planning to open up more hooks at some point in the near future that would let you modify the plan at other phases of query execution. See SPARK-18127 for more details.
Michael Armbrust's answer confirmed that this kind of transformation shouldn't be done via optimisations.
I've instead used internal APIs in Spark to achieve the transformation I wanted for now. It requires methods that are package-private in Spark. So we can access them without reflection by putting the relevant logic in the appropriate package. In outline:
// Must be in the spark.sql package.
package org.apache.spark.sql
object SQLTransformer {
def apply(sparkSession: SparkSession, ...) = {
// Get the AST.
val ast = sparkSession.sessionState.sqlParser.parsePlan(sql)
// Transform the AST.
val transformedAST = ast match {
case node: Project => // Modify any top-level projection
...
}
// Create a dataset directly from the AST.
Dataset.ofRows(sparkSession, transformedAST)
}
}
Note that this of course may break with future versions of Spark.

Resources