SparkSQL equivalent for SQL compiled statement with variables - apache-spark

I need to execute SparkSQL statements in an efficient manner. Eg. compile once, execute many times (with different parameter values).
For a simple SQL example:
select * from my_table where year=:1
where :1 is a bind variable, and thus the statement is only compiled once, and executed N times (with different values), I need the same SparkSQL equivalente.
Things like:
year = 2020
df_result = spark.sql("select * from my_table where year={0}".format(year))
are not what I expect, since are not really bound variables, but just one specific instantiated sentence.

Depending on where your data are stored, your cluster resources, size of table etc... you might consider caching the entire table, that will at least prevent spark from having to read off disk/blob storage on every execution of the query
catalog = sparkSession.catalog
if catalog.isCached("my_table")):
df_my_table.cache()
df_result = df_my_table.filter("year='" + str(year) + "'")
There may be many ways to do this better depending on your architecture, but i'm sticking to a 100% spark based solution here

Related

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

Spark reuse broadcast DF

I would like to reuse my DataFrame (without falling back to doing this using "Map" function in RDD/Dataset) which I marking as broadcast-eable, but seems Spark keeps broadcasting it again and again.
Having a table "bank" (test table). I perform the following:
val cachedDf = spark.sql("select * from bank").cache
cachedDf.count
val dfBroadcasted = broadcast(cachedDf)
val dfNormal = spark.sql("select * from bank")
dfNormal.join(dfBroadcasted, List("age"))
.join(dfBroadcasted, List("age")).count
I'm caching before just in case it made a difference, but its the same with or without.
If I execute the above code, I see the following SQL plan:
As you can see, my broadcasted DF gets broadcasted TWICE with also different timings (if I add more actions afterwards, they broadcast again too).
I care about this, because I actually have a long-running program which has a "big" DataFrame which I can use to filter out HUGE DataFrames, and I would like that "big" DataFrame to be reused.
Is there a way to force reusability? (not only inside the same action, but between actions, I could survive with the same action tho)
Thanks!,
Ok, updating the question.
Summarising:
INSIDE the same action, left_semis will reuse broadcasts
while normal/left joins won't. Not sure related with the fact that Spark/developers already know the columns of that DF won't affect the output at all so they can reuse it or it's just an optimization spark is missing.
My problem seems mostly-solved, although it would be great if someone knew how to keep the broadcast across actions.
If I use left_semi (which is the join i'm going to use in my real app), the broadcast is only performed once.
With:
dfNormalxx.join(dfBroadcasted, Seq("age"),"left_semi")
.join(dfBroadcasted, Seq("age"),"left_semi").count
The plan becomes (I also changed the size so it matches my real one, but this made no difference):
Also the wall total time is much better than when using "left_semi" (I set 1 executor so it doesn't get parallelized, just wanted to check if the job was really being done twice):
Even though my collect takes 10 seconds, this will speedup table reads+groupBys which are taking like 6-7minutes

Spark Sql - Running twice

I come across spark code for ETL process, in that code long complex sql statements are written and facing oom error and sometime single job takes 4 hours with multple execution of same code.
They have many etl processes like this and i pasted the example query here with long complex joins with nesting and aggregation, group by, ordr by etc. Btw still it is not full query
Please look at below query and they are using it as
SqlContext.sql(below query).write.mode(append).insertinto(hivetbl)
Is this the right way of utilizing spark ??
SELECT
above 30 joins with aggregations
I also had the same problem. Spark infers the type of column by running the user's query with an empty result. (check out https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala line 112, getSchemaQuery function)
Simple queries like (select ... from ...) where 1=0 would be super fast, but the queries that has join operation will be slow as much as join operation works.
Maybe you can use (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) "customSchema" option to prevent spark infer your types of result.
Did you try breaking the SQL into multiple fragments? Probably one join per fragment?
take a join write the output into parquet file
join the output with another table and dump it into parquet file
repeat the above.
Joins in big-data are distributed and have a lot of assumptions underlying. Incidentally, for your use-case, prestodb seems to be the right fit.

How to change Spark GroupBy/OrderBy comparator to deal with encrypted data

I'm doing a university work which I am trying to make Spark SQL work over encrypted data (with my algorithms). I implemented some functions that allow comparing two encrypted values in terms of their equality and order, and I am using UDF/UDAF's functions to execute these functions.
For example, if I want to execute this query:
SELECT count(SALARY) FROM table1 WHERE age > 20
I convert this one into:
SELECT mycount_udf(SALARY) FROM table1 WHERE myfilter_udf(greater_udf(age,20))
where mycount_udf, my_filter_udf and greater_udf are UDAF and UDF's implemented to deal with my functions over encrypted data.
However, I am facing a problem when I want to execute query's like ORDER BY/GROUP BY. The internals of these operators use operations of equality and order to execute the query. However, to allow to execute queries correctly over my encrypted values, I have to change the comparators inside ORDER BY/GROUP BY, in order to use my UDF comparators (equality_udf, greater_udf, etc).
If I encrypt:
x = 5 => encrypted_x = KSKFA92
y = 6 => encrypted_y = A9283NA
As 5<6, greater_udf(5,6) will return False. So I have to use this comparator inside ORDER BY (SORT) to execute the query correctly because Spark doesn't know that the values are encrypted, and when it compares encrypted_x with encrypted_y using == or a comparator between Spark DataTypes, will cause a wrong result.
Is there any way to do this without changing Spark GROUP BY/ORDER BY source code? It seems me not possible to use UDF/UDAF. I am using JAVA to do this work.

How should I configure Spark to correctly prune Hive Metastore partitions?

I'm having an issues when applying partition filters to Spark (v2.0.2/2.1.1) DataFrames that read from a Hive (v2.1.0) table with over 30,000 partitions. I would like to know what the recommended approach is and what, if anything, I'm doing incorrectly as the current behaviour is a source of large performance an reliability issues.
To enable pruning, I am using the following Spark/Hive property:
--conf spark.sql.hive.metastorePartitionPruning=true
When running a query in spark-shell I can see the partition fetch take place with an invocation to ThriftHiveMetastore.Iface.get_partitions, but this unexpectedly occurs without any filtering:
val myTable = spark.table("db.table")
val myTableData = myTable
.filter("local_date = '2017-09-01' or local_date = '2017-09-02'")
.cache
// The HMS call invoked is:
// #get_partitions('db', 'table', -1)
If I use a more simplistic filter, partitions are filtered as desired:
val myTableData = myTable
.filter("local_date = '2017-09-01'")
.cache
// The HMS call invoked is:
// #get_partitions_by_filter(
// 'db', 'table',
// 'local_date = "2017-09-01"',
// -1
// )
The filtering also works correctly if I rewrite the filter to use range operators instead of simply checking for equality:
val myTableData = myTable
.filter("local_date >= '2017-09-01' and local_date <= '2017-09-02'")
.cache
// The HMS call invoked is:
// #get_partitions_by_filter(
// 'db', 'table',
// 'local_date >= '2017-09-01' and local_date <= '2017-09-02'',
// -1
// )
In our case, this behaviour is problematic from a performance perspective; call times are in the region of 4 minutes versus 1 second when correctly filtered. Additionally, routinely loading large volumes ofPartition objects onto the heap per query ultimately leads to memory issues in the metastore service.
It seems as though there is a bug around the parsing and interpretation of certain types of filter constructs, however I've not been able to find a relevant issue in the Spark JIRA. Is there a preferred approach or specific Spark version where filters are correctly applied for all filter variants? Or must I use specific forms (e.g. range operators) when constructing filters? If so, is this limitation documented anywhere?
I have not found a preferred way of querying besides rewriting the filter as described in my (OP) question. I did find that spark has improved support for this and it looks like my case is addressed in Spark 2.3.0. This is the ticket fixing the problem I found: SPARK-20331

Resources