Custom Cypher Query Restrictions - Spark Neo4j Connector - apache-spark

Are there any limitations of writing custom cypher queries while running them through spark connector? I am using a query with call and apoc library procedures. The same query works fine in cypher shell, but fails with error -
pyspark.sql.utils.IllegalArgumentException: Please provide a valid WRITE query
when run through spark connector. Any documentation around what all are allowed in custom cypher using spark connector?
My query looks something like this -
WITH event.dVal as dVal, COLLECT(cVal) as cVals,
apoc.coll.toSet(apoc.coll.flatten(COLLECT(cVal.sdVals))) as sdVals
CALL {
WITH dVal, cVals, sdVals
MERGE (d:DVal {DVal:dVal})
WITH d, cVals, sdVals
CALL {
WITH d, sdVals
UNWIND sdVals as sdVal
WITH d, sdVal
MERGE (sd:DVal {DVal: sdVal})
WITH d, sd
OPTIONAL MATCH (d)-[rel3:Relation1]->(sd)
FOREACH (o IN CASE WHEN rel3 IS NULL THEN [1] ELSE [] END |
CREATE (d)-[:Relation1]->(sd)
)
RETURN sd
}
WITH d, cVals, COLLECT(id(sd)) as sds
CALL {
WITH d, cVals
UNWIND cVals as cVal
WITH d, cVal
MERGE (c:CVal {CId: cVal.id})
WITH d, c, cVal
OPTIONAL MATCH (c)-[rel1:Relation2]->(d)
FOREACH (o IN CASE WHEN rel1 IS NULL THEN [1] ELSE [] END |
CREATE (c)-[:Relation2]->(d)
)
}
}

Related

ArangoDB AQL optimisation for a cluster configuration

With data structure like
Departure -> Trip -> Driver
using an ArangoDB Spring Data derived query in the Trip repository like findByDriverIdNumberAndDepartureStartTimeBetween( String idNumber, String startTime, String endTime ) results in an AQL query like
WITH driver, departure
FOR e IN trip
FILTER
(FOR e1 IN 1..1 OUTBOUND e._id tripToDriver FILTER e1.idNumber == '999999-9999' RETURN 1)[0] == 1
AND
(FOR e1 IN 1..1 INBOUND e._id departureToTrip FILTER e1.startTime >= '2019-08-14T00:00:00' AND e1.startTime <= '2019-08-14T23:59:59' RETURN 1)[0] == 1
RETURN e
which performs fine (~1s) with a single instance setup, but after setting up a cluster with the Kubernetes ArangoDB Operator with default settings (3 nodes and coordinators) the query time increased tenfold, which is is probably due to sharding and multi-machine communication to fulfil the query.
This attempt to optimise the query gave better results, query time around 3 to 4 seconds:
WITH driver, departure
FOR doc IN trip
LET drivers = (FOR v IN 1..1 OUTBOUND doc tripToDriver RETURN v)
FILTER drivers[0].idNumber == '999999-9999'
LET departures = (FOR v in 1..1 INBOUND doc departureToTrip RETURN v)
FILTER departures[0].startTime >= '2019-08-14T00:00:00' AND departures[0].startTime <= '2019-08-14T23:59:59'
RETURN doc
But can I optimise the query further for the cluster setup, to come closer to the single instance query time of one second?

U-SQL Error in Naming the Column

I have a JSON where the order of fields is not fixed.
i.e. I can have [A, B, C] or [B, C, A]
All A, B, C are json objects are of the form {Name: x, Value:y}.
So, when I use USQL to extract the JSON (I don't know their order) and put it into a CSV (for which I will need column name):
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
So, I am trying to put column name as the "Name" field in the JSON.
But am getting the error at #### above:
Message
syntax error. Expected one of: FROM ',' EXCEPT GROUP HAVING INTERSECT OPTION ORDER OUTER UNION UNION WHERE ';' ')'
Resolution
Correct the script syntax, using expected token(s) as a guide.
Description
Invalid syntax found in the script.
Details
at token '(', line 74
near the ###:
**************
I am not allowed to put the correct column name "dynamically" and it is an absolute necessity of my issue.
Input: [A, B, C,], [C, B, A]
Output: A.name B.name C.name
Row 1's values
Row 2's values
This
#output =
SELECT
A["Value"] ?? "0" AS CAST ### (("System_" + A["Name"]) AS STRING),
B["Value"] ?? "0" AS "System_" + B["Name"],
System_da
is not a valid SELECT clause (neither in U-SQL nor any other SQL dialect I am aware of).
What is the JSON Array? Is it a key/value pair? Or positional? Or a single value in the array that you want to have a marker for whether it is present in the array?
From your example, it seems that you want something like:
Input:
[["A","B","C"],["C","D","B"]]
Output:
A B C D
true true true false
false true true true
If that is the case, I would write it as:
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[\"A\", \"B\", \"C\"],[\"C\", \"D\", \"B\"]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json).Values) AS T(arrstring);
#data =
SELECT a.Contains("A") AS A, a.Contains("B") AS B, a.Contains("C") AS C, a.Contains("D") AS D
FROM (SELECT a.Values AS a FROM #data) AS t;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);
If you need something more dynamic, either use the resulting SqlArray or SqlMap or use the above approach to generate the script.
However, I wonder why you would model your information this way in the first place. I would recommend finding a more appropriate way to mark the presence of the value in the JSON.
UPDATE: I missed your comment about that the inner array members are an object with two key-value pairs, where one is always called name (for property) and one is always called value for the property value. So here is the answer for that case.
First: Modelling key value pairs in JSON using {"Name": "propname", "Value" : "value"} is a complete misuse of the flexible modelling capabilities of JSON and should not be done. Use {"propname" : "value"} instead if you can.
So changing the input, the following will give you the pivoted values. Note that you will need to know the values ahead of time and there are several options on how to do the pivot. I do it in the statement where I create the new SqlMap instance to reduce the over-modelling, and then in the next SELECT where I get the values from the map.
REFERENCE ASSEMBLY [Newtonsoft.Json];
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
USING Microsoft.Analytics.Samples.Formats.Json;
#input =
SELECT "[[{\"Name\":\"A\", \"Value\": 1}, {\"Name\": \"B\", \"Value\": 2}, {\"Name\": \"C\", \"Value\":3 }], [{\"Name\":\"C\", \"Value\": 4}, {\"Name\":\"D\", \"Value\": 5}, {\"Name\":\"B\", \"Value\": 6}]]" AS json
FROM (VALUES (1)) AS T(x);
#data =
SELECT JsonFunctions.JsonTuple(arrstring) AS a
FROM #input CROSS APPLY EXPLODE( JsonFunctions.JsonTuple(json)) AS T(rowid, arrstring);
#data =
SELECT new SqlMap<string, string>(
a.Values.Select((kvp) =>
new KeyValuePair<string, string>(
JsonFunctions.JsonTuple(kvp)["Name"]
, JsonFunctions.JsonTuple(kvp)["Value"])
)) AS kvp
FROM #data;
#data =
SELECT kvp["A"] AS A,
kvp["B"] AS B,
kvp["C"] AS C,
kvp["D"] AS D
FROM #data;
OUTPUT #data
TO "/output/data.csv"
USING Outputters.Csv(outputHeader : true);

Transforming Spark SQL AST with extraOptimizations

I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query.
I was hoping to achieve this by hooking into Catalyst using sparkSession.experimental.extraOptimizations. I know that what I'm attempting isn't strictly speaking an optimisation (the transformation changes the semantics of the SQL statement), but the API still seems suitable. However, my transformation seems to be ignored by the query executor.
Here is a minimal example to illustrate the issue I'm having. First define a row case class:
case class TestRow(a: Int, b: Int, c: Int)
Then define an optimisation rule which simply discards any projection:
object RemoveProjectOptimisationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case x: Project => x.child
}
}
Now create a dataset, register the optimisation, and run a SQL query:
// Create a dataset and register table.
val dataset = List(TestRow(1, 2, 3)).toDS()
val tableName: String = "testtable"
dataset.createOrReplaceTempView(tableName)
// Register "optimisation".
sparkSession.experimental.extraOptimizations =
Seq(RemoveProjectOptimisationRule)
// Run query.
val projected = sqlContext.sql("SELECT a FROM " + tableName + " WHERE a = 1")
// Print query result and the queryExecution object.
println("Query result:")
projected.collect.foreach(println)
println(projected.queryExecution)
Here is the output:
Query result:
[1]
== Parsed Logical Plan ==
'Project ['a]
+- 'Filter ('a = 1)
+- 'UnresolvedRelation `testtable`
== Analyzed Logical Plan ==
a: int
Project [a#3]
+- Filter (a#3 = 1)
+- SubqueryAlias testtable
+- LocalRelation [a#3, b#4, c#5]
== Optimized Logical Plan ==
Filter (a#3 = 1)
+- LocalRelation [a#3, b#4, c#5]
== Physical Plan ==
*Filter (a#3 = 1)
+- LocalTableScan [a#3, b#4, c#5]
We see that the result is identical to that of the original SQL statement, without the transformation applied. Yet, when printing the logical and physical plans, the projection has indeed been removed. I've also confirmed (through debug log output) that the transformation is indeed being invoked.
Any suggestions as to what's going on here? Maybe the optimiser simply ignores "optimisations" that change semantics?
If using the optimisations isn't the way to go, can anybody suggest an alternative? All I really want to do is parse the input SQL statement, transform it, and pass the transformed AST to Spark for execution. But as far as I can see, the APIs for doing this are private to the Spark sql package. It may be possible to use reflection, but I'd like to avoid that.
Any pointers would be much appreciated.
As you guessed, this is failing to work because we make assumptions that the optimizer will not change the results of the query.
Specifically, we cache the schema that comes out of the analyzer (and assume the optimizer does not change it). When translating rows to the external format, we use this schema and thus are truncating the columns in the result. If you did more than truncate (i.e. changed datatypes) this might even crash.
As you can see in this notebook, it is in fact producing the result you would expect under the covers. We are planning to open up more hooks at some point in the near future that would let you modify the plan at other phases of query execution. See SPARK-18127 for more details.
Michael Armbrust's answer confirmed that this kind of transformation shouldn't be done via optimisations.
I've instead used internal APIs in Spark to achieve the transformation I wanted for now. It requires methods that are package-private in Spark. So we can access them without reflection by putting the relevant logic in the appropriate package. In outline:
// Must be in the spark.sql package.
package org.apache.spark.sql
object SQLTransformer {
def apply(sparkSession: SparkSession, ...) = {
// Get the AST.
val ast = sparkSession.sessionState.sqlParser.parsePlan(sql)
// Transform the AST.
val transformedAST = ast match {
case node: Project => // Modify any top-level projection
...
}
// Create a dataset directly from the AST.
Dataset.ofRows(sparkSession, transformedAST)
}
}
Note that this of course may break with future versions of Spark.

What does "Correlated scalar subqueries must be Aggregated" mean?

I use Spark 2.0.
I'd like to execute the following SQL query:
val sqlText = """
select
f.ID as TID,
f.BldgID as TBldgID,
f.LeaseID as TLeaseID,
f.Period as TPeriod,
coalesce(
(select
f ChargeAmt
from
Fact_CMCharges f
where
f.BldgID = Fact_CMCharges.BldgID
limit 1),
0) as TChargeAmt1,
f.ChargeAmt as TChargeAmt2,
l.EFFDATE as TBreakDate
from
Fact_CMCharges f
join
CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE)
where
f.ActualProjected = 'Lease'
except(
select * from TT1 t2 left semi join Fact_CMCharges f2 on t2.TID=f2.ID)
"""
val query = spark.sql(sqlText)
query.show()
It seems that the inner statement in coalesce gives the following error:
pyspark.sql.utils.AnalysisException: u'Correlated scalar subqueries must be Aggregated: GlobalLimit 1\n+- LocalLimit 1\n
What's wrong with the query?
You have to make sure that your sub-query by definition (and not by data) only returns a single row. Otherwise Spark Analyzer complains while parsing the SQL statement.
So when catalyst can't make 100% sure just by looking at the SQL statement (without looking at your data) that the sub-query only returns a single row, this exception is thrown.
If you are sure that your subquery only gives a single row you can use one of the following aggregation standard functions, so Spark Analyzer is happy:
first
avg
max
min

Getting metadata in plain SQL statement in Slick 3.1.x

In the following plain SQL statement in Slick I know beforehand that it will return a list of (String, String)
sql"""select c.name, s.name
from coffees c, suppliers s
where c.price < $price and s.id = c.sup_id""".as[(String, String)]
But what if I don't know the column types? Can I analyze the metadata and retrieve the values? In JDBC I could use getInt(n) and getString(n), is there anything similar in Slick?
You can use tsql (Type-Checked SQL Statements):
tsql"""select c.name, s.name
from coffees c, suppliers s
where c.price < $price and s.id = c.sup_id"""
this will return a DBIO[Seq[(String, String)]] (depending on the column types).
produces a DBIOAction of the correct type without requiring a call to .as
Note: I've found it a little flakey (to the point of being unusable) with option types, so beware if your columns can be null (since null: String).
This requires a little bit of wiring up, you need #StaticDatabaseConfig (e.g. on your DAO), as these types are checked, against the database, at compile time:
# annotate the object
#StaticDatabaseConfig("file:src/main/resources/application.conf#tsql")
...
val dc = DatabaseConfig.forAnnotation[JdbcProfile]
import dc.driver.api._
val db = dc.db
# to pull out a Future[Seq[String, String]]
# use db.run(tsql"...")
# to pull out a Future[Option[(String, String)]]
# use db.run(tsql"...".headOption)
# etc.

Resources