How to extend Spark Catalyst optimizer with custom rules? - apache-spark

I want to use Catalyst rules to transform star-schema (https://en.wikipedia.org/wiki/Star_schema) SQL query to SQL query to denormalized star-schema where some fields from dimensions tables are represented in facts table.
I tried to find some extension points to add own rules to make a transformation described above. But I didn't find any extension points. So there are the following questions:
How can I add own rules to catalyst optimizer?
Is there another solution to implement a functionality described above?

Following #Ambling advice you can use the sparkSession.experimental.extraStrategies to add your functionality to the SparkPlanner.
An example strategy that simply prints "Hello world" on the console
object MyStrategy extends Strategy {
def apply(plan: LogicalPlan): Seq[SparkPlan] = {
println("Hello world!")
Nil
}
}
with example run:
val spark = SparkSession.builder().master("local").getOrCreate()
spark.experimental.extraStrategies = Seq(MyStrategy)
val q = spark.catalog.listTables.filter(t => t.name == "five")
q.explain(true)
spark.stop()
You can find a working example project on friend's GitHub: https://github.com/bartekkalinka/spark-custom-rule-executor

As a clue, now in Spark 2.0, you can import extraStrategies and extraOptimizations through SparkSession.experimental.

Related

Multiple keys index leads in an PSQLException with Slick, Postgres on a lagom project

I have a lagom application and using for the readside a postgres with lagom jdbc.
Tables are created and works fine. After a restart and an already created table with a multiple key index I get always an error:
org.postgresql.util.PSQLException: ERROR: relation "article_number_fulfiller_idx" already exists
My table looks like this:
class ArticleTable(tag: Tag) extends Table[ArticleTableData](tag,ArticleTable.TableName) {
def entityId = column[UUID](ArticleTable.ColEntityId,O.PrimaryKey)
def articleBaseNumber = column[String](ArticleTable.ColArticleBaseNumber)
def articleSpecificationNumber = column[Option[String]](ArticleTable.ColArticleSpecificationNumber)
def fulfillerVendorNumber = column[String](ArticleTable.ColFulfillerVendorNumber)
def fulfillerName = column[String](ArticleTable.ColFulfillerName)
def availability = column[String](ArticleTable.ColAvailability)
def completeArticleNumber = column[String]("complete_article_number")
def idxKey = index("article_number_fulfiller_idx",(completeArticleNumber,fulfillerVendorNumber),unique = true)
def * = (entityId,articleBaseNumber,articleSpecificationNumber,fulfillerVendorNumber,fulfillerName,availability,completeArticleNumber) <> ( (ArticleTableData.apply _).tupled, ArticleTableData.unapply )
}
And my build handler is here:
override def buildHandler(): ReadSideProcessor.ReadSideHandler[Article.Event] = readSide
.builder[Article.Event](ArticleTable.TableName+"_offset")
.setGlobalPrepare(table.schema.createIfNotExists)
.setEventHandler[ArticleCreated](insert)
.setEventHandler[DescriptionAdded](_ => DBIOAction.successful(Done) )
.setEventHandler[DescriptionRemoved](_ => DBIOAction.successful(Done) )
.build()
I updated my sbt to use the latest:
Instead of this
lagomScaladslPersistenceJdbc
I use now this
"com.lightbend.lagom" %% "lagom-scaladsl-persistence-jdbc" % "1.6.2",
"com.typesafe.slick" %% "slick" % "3.3.2"
The exception is only ONE of the exceptions I got. I have for every multiple key index an exception :(
Lagom every restart will try to create new tables and indexes. To avoid getting these errors, do not force the creation of indexes and tables, use create if not exist. If slick does not allow to do it, look on native queries.
You are right about the slick issue.
The page describes that it is useful for dev and test environments.
In this case, you can try to have:
"create if not exist" query as #vladislav-kievski suggested,
some of the databases do not support such queries (SQL Server) and you can catch an exception with appropriate error code
I do not recommend to use globalPrepare() for creating tables and indexes. The main difficulty with this approach is table changes. In this case, you need to think about versioning your database scripts (what will you do in case index removing/adding or removing columns?).

Spark function aliases - performant udfs

Context
In many of the sql queries I write, I find myself combining spark predefined functions in the exact same way, which often results in verbose and duplicated code, and my developer instinct is to want to refactor it.
So, my question is this : is there some way to define some kind of alias for function combinations without resorting to udfs (which are to avoid for perofmance reasons) - the goal being to make the code clearer and cleaner. Essentially, what I want is something like udfs but without the performance penalty. Also, these function MUST be callable from within a spark-sql query usable in spark.sql calls.
Example
For example, let's say my business logic is to reverse some string and hash it like this : (please note that the function combination here is irrelevant, what is important is that it is some combination of existing pre-defined spark functions - possibly many of them)
SELECT
sha1(reverse(person.name)),
sha1(reverse(person.some_information)),
sha1(reverse(person.some_other_information))
...
FROM person
Is there a way of declaring a business function without paying the performance price of using a udf, allowing the code just above to be rewritten as :
SELECT
business(person.name),
business(person.some_information),
business(person.some_other_information)
...
FROM person
I have searched around quite a bit on the spark documentation and on this website and have not found a way of achieving this, which is pretty weird to me because it looks like a pretty natural need, and I don't understand why you should necessarly pay the black-box price of defining and calling a udf.
Is there a way of declaring a business function without paying the performance price of using a udf
You don't have to use udf, you might extend Expression class, or for the simplest operations - UnaryExpression. Then you will have to implement just several methods and here we go. It is natively integrated into Spark, besides that letting use some advantage features such as code generation.
In your case adding business function is pretty straightforward:
def business(column: Column): Column = {
sha1(reverse(column))
}
MUST be callable from within a spark-sql query usable in spark.sql calls
This is more tricky but achievable.
You need to create custom functions registrar:
import org.apache.spark.sql.catalyst.FunctionIdentifier
import org.apache.spark.sql.catalyst.expressions.Expression
object FunctionAliasRegistrar {
val funcs: mutable.Map[String, Seq[Column] => Column] = mutable.Map.empty
def add(name: String, builder: Seq[Column] => Column): this.type = {
funcs += name -> builder
this
}
def registerAll(spark: SparkSession) = {
funcs.foreach { case (alias, builder) => {
def b(children: Seq[Expression]) = builder.apply(children.map(expr => new Column(expr))).expr
spark.sessionState.functionRegistry.registerFunction(FunctionIdentifier(alias), b)
}}
}
}
Then you can use it as follows:
FunctionAliasRegistrar
.add("business1", child => lower(reverse(child.head)))
.add("business2", child => upper(reverse(child.head)))
.registerAll(spark)
dataset.createTempView("data")
spark.sql(
"""
| SELECT business1(name), business2(name) FROM data
|""".stripMargin)
.show(false)
Output:
+--------------------+--------------------+
|lower(reverse(name))|upper(reverse(name))|
+--------------------+--------------------+
|sined |SINED |
|taram |TARAM |
|1taram |1TARAM |
|2taram |2TARAM |
+--------------------+--------------------+
Hope this helps.

How can you update values in a dataset?

So as far as I know Apache Spark doesn't has a functionality that imitates the update SQL command. Like, I can change a single value in a column given a certain condition. The only way around that is to use the following command I was instructed to use (here in Stackoverflow): withColumn(columnName, where('condition', value));
However, the condition should be of column type, meaning I have to use the built in column filtering functions apache has (equalTo, isin, lt, gt, etc). Is there a way I can instead use an SQL statement instead of those built in functions?
The problem is I'm given a text file with SQL statements, like WHERE ID > 5 or WHERE AGE != 50, etc. Then I have to label values based on those conditions, and I thought of following the withColumn() approach but I can't plug-in an SQL statement in that function. Any idea of how I can go around this?
I found a way to go around this:
You want to split your dataset into two sets: the values you want to update and the values you don't want to update
Dataset<Row> valuesToUpdate = dataset.filter('conditionToFilterValues');
Dataset<Row> valuesNotToUpdate = dataset.except(valuesToUpdate);
valueToUpdate = valueToUpdate.withColumn('updatedColumn', lit('updateValue'));
Dataset<Row> updatedDataset = valuesNotToUpdate.union(valueToUpdate);
This, however, doesn't keep the same order of records as the original dataset, so if order is of importance to you, this won't suffice your needs.
In PySpark you have to use .subtract instead of .except
If you are using DataFrame, you can register that dataframe as temp table,
using df.registerTempTable("events")
Then you can query like,
sqlContext.sql("SELECT * FROM events "+)
when clause translates into case clause which you can relate to SQL case clause.
Example
scala> val condition_1 = when(col("col_1").isNull,"NA").otherwise("AVAILABLE")
condition_1: org.apache.spark.sql.Column = CASE WHEN (col_1 IS NULL) THEN NA ELSE AVAILABLE END
or you can chain when clause as well
scala> val condition_2 = when(col("col_1") === col("col_2"),"EQUAL").when(col("col_1") > col("col_2"),"GREATER").
| otherwise("LESS")
condition_2: org.apache.spark.sql.Column = CASE WHEN (col_1 = col_2) THEN EQUAL WHEN (col_1 > col_2) THEN GREATER ELSE LESS END
scala> val new_df = df.withColumn("condition_1",condition_1).withColumn("condition_2",condition_2)
Still if you want to use table, then you can register your dataframe / dataset as temperory table and perform sql queries
df.createOrReplaceTempView("tempTable")//spark 2.1 +
df.registerTempTable("tempTable")//spark 1.6
Now, you can perform sql queries
spark.sql("your queries goes here with case clause and where condition!!!")//spark 2.1
sqlContest.sql("your queries goes here with case clause and where condition!!!")//spark 1.6
If you are using java dataset
you can update dataset by below.
here is the code
Dataset ratesFinal1 = ratesFinal.filter(" on_behalf_of_comp_id != 'COMM_DERIVS' ");
ratesFinal1 = ratesFinal1.filter(" status != 'Hit/Lift' ");
Dataset ratesFinalSwap = ratesFinal1.filter (" on_behalf_of_comp_id in ('SAPPHIRE','BOND') and cash_derivative != 'cash'");
ratesFinalSwap = ratesFinalSwap.withColumn("ins_type_str",functions.lit("SWAP"));
adding new column with value from existing column
ratesFinalSTW = ratesFinalSTW.withColumn("action", ratesFinalSTW.col("status"));

Apache Spark DataSet API : head(n:Int) vs take(n:Int)

Apache Spark Dataset API has two methods i.e, head(n:Int) and take(n:Int).
Dataset.Scala source contains
def take(n: Int): Array[T] = head(n)
Couldn't find any difference in execution code between these two functions. why do API has two different methods to yield the same result?
The reason is because, in my view, Apache Spark Dataset API is trying to mimic Pandas DataFrame API which contains head https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html.
I have experimented & found that head(n) and take(n) gives exactly same replica output. Both produces output in the form of ROW object only.
DF.head(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]
DF.take(2)
[Row(Transaction_date=u'1/2/2009 6:17', Product=u'Product1', Price=u'1200', Payment_Type=u'Mastercard', Name=u'carolina', City=u'Basildon', State=u'England', Country=u'United Kingdom'), Row(Transaction_date=u'1/2/2009 4:53', Product=u'Product2', Price=u'1200', Payment_Type=u'Visa', Name=u'Betina', City=u'Parkville', State=u'MO', Country=u'United States')]
package org.apache.spark.sql
/* ... */
def take(n: Int): Array[T] = head(n)
I think this is because spark developers tends to give it a rich API, there also the two methods where and filter which does exactly the same thing.

Transforming Spark SQL AST with extraOptimizations

I'm wanting to take a SQL string as a user input, then transform it before execution. In particular, I want to modify the top-level projection (select clause), injecting additional columns to be retrieved by the query.
I was hoping to achieve this by hooking into Catalyst using sparkSession.experimental.extraOptimizations. I know that what I'm attempting isn't strictly speaking an optimisation (the transformation changes the semantics of the SQL statement), but the API still seems suitable. However, my transformation seems to be ignored by the query executor.
Here is a minimal example to illustrate the issue I'm having. First define a row case class:
case class TestRow(a: Int, b: Int, c: Int)
Then define an optimisation rule which simply discards any projection:
object RemoveProjectOptimisationRule extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case x: Project => x.child
}
}
Now create a dataset, register the optimisation, and run a SQL query:
// Create a dataset and register table.
val dataset = List(TestRow(1, 2, 3)).toDS()
val tableName: String = "testtable"
dataset.createOrReplaceTempView(tableName)
// Register "optimisation".
sparkSession.experimental.extraOptimizations =
Seq(RemoveProjectOptimisationRule)
// Run query.
val projected = sqlContext.sql("SELECT a FROM " + tableName + " WHERE a = 1")
// Print query result and the queryExecution object.
println("Query result:")
projected.collect.foreach(println)
println(projected.queryExecution)
Here is the output:
Query result:
[1]
== Parsed Logical Plan ==
'Project ['a]
+- 'Filter ('a = 1)
+- 'UnresolvedRelation `testtable`
== Analyzed Logical Plan ==
a: int
Project [a#3]
+- Filter (a#3 = 1)
+- SubqueryAlias testtable
+- LocalRelation [a#3, b#4, c#5]
== Optimized Logical Plan ==
Filter (a#3 = 1)
+- LocalRelation [a#3, b#4, c#5]
== Physical Plan ==
*Filter (a#3 = 1)
+- LocalTableScan [a#3, b#4, c#5]
We see that the result is identical to that of the original SQL statement, without the transformation applied. Yet, when printing the logical and physical plans, the projection has indeed been removed. I've also confirmed (through debug log output) that the transformation is indeed being invoked.
Any suggestions as to what's going on here? Maybe the optimiser simply ignores "optimisations" that change semantics?
If using the optimisations isn't the way to go, can anybody suggest an alternative? All I really want to do is parse the input SQL statement, transform it, and pass the transformed AST to Spark for execution. But as far as I can see, the APIs for doing this are private to the Spark sql package. It may be possible to use reflection, but I'd like to avoid that.
Any pointers would be much appreciated.
As you guessed, this is failing to work because we make assumptions that the optimizer will not change the results of the query.
Specifically, we cache the schema that comes out of the analyzer (and assume the optimizer does not change it). When translating rows to the external format, we use this schema and thus are truncating the columns in the result. If you did more than truncate (i.e. changed datatypes) this might even crash.
As you can see in this notebook, it is in fact producing the result you would expect under the covers. We are planning to open up more hooks at some point in the near future that would let you modify the plan at other phases of query execution. See SPARK-18127 for more details.
Michael Armbrust's answer confirmed that this kind of transformation shouldn't be done via optimisations.
I've instead used internal APIs in Spark to achieve the transformation I wanted for now. It requires methods that are package-private in Spark. So we can access them without reflection by putting the relevant logic in the appropriate package. In outline:
// Must be in the spark.sql package.
package org.apache.spark.sql
object SQLTransformer {
def apply(sparkSession: SparkSession, ...) = {
// Get the AST.
val ast = sparkSession.sessionState.sqlParser.parsePlan(sql)
// Transform the AST.
val transformedAST = ast match {
case node: Project => // Modify any top-level projection
...
}
// Create a dataset directly from the AST.
Dataset.ofRows(sparkSession, transformedAST)
}
}
Note that this of course may break with future versions of Spark.

Resources