Does spark load all data from Kudu when a filter is used? - apache-spark

I am new to spark. Will the following code load all data or just filter data from kudu?
val df: DataFrame = spark.read.options(Map(
"kudu.master" -> kuduMaster,
"kudu.table" -> s"impala::platform.${table}")).kudu
val outPutDF = df.filter(row => {
val recordAt: Long = row.getAs("record_at").toString.toLong
recordAt >= XXX && recordAt < YYY
})

The easiest way to check if the filter is pushed down or not for a given connector is using Spark UI.
The scan nodes in Spark will have the metrics of number of records read from the datasource.(You can check this Spark UI -> SQL tab, ater running a query)
Write a query with and without an explicit predicate(on a small dataset).
Inferences
1. If the number of records in scan node is the same with and without predicate - Spark has read the data completely from datasource and filtering will be done in Spark.
2. If the numbers are different, predicate pushdown has been implemented in the datasource connector.
3. Using this experiment you could also figure which kinds of predicates are pushed down.(depends on connector implementation)

Try .explain.
Not sure I got the code, but here is an example of some code that works.
val dfX = df.map(row => XYZ{row.getAs[Long]("someVal")}).filter($"someVal" === 2)
But assuming you can get code to work, Spark "predicate pushdown" will apply in your case and filtering in Kudu Storage Manager applied. So, not all data loaded.
This is from the KUDU Guide:
<> and OR predicates are not pushed to Kudu, and instead will be
evaluated by the Spark task. Only LIKE predicates with a suffix
wildcard are pushed to Kudu, meaning that LIKE "FOO%" is pushed down
but LIKE "FOO%BAR" isn’t.
That is to say, your case is OK.

Related

Spark 2.x Dataframe write consistency check in Append Mode

I am reading the data in Spark inside for loop and performing joins and writing the data into the path in append mode.
for (partition <- partitionlist) {
var df = spark.read.parquet("path")
var df2 = df.join(anotherdf, col("col1") === col("col1"))
df2.write.mode("SaveMode.Append").partitionBy("partitionColumn").format("parquet").save("anotherpath")
}
In the above sample code, we are using spark 2.X version. Since spark 2 write APIs are not consistent, Is it possible that in case of any iteration, if the stages/task go in retries(in writing to the path) and get successful after a few retries, Is it possible that we see the data redundancy in the written data of that for loop's iteration where retry happened?
EDIT: spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 is being used.

How to specify the filter condition with spark DataFrameReader API for a table?

I was reading about spark on databricks documentation https://docs.databricks.com/data/tables.html#partition-pruning-1
It says
When the table is scanned, Spark pushes down the filter predicates
involving the partitionBy keys. In that case, Spark avoids reading
data that doesn’t satisfy those predicates. For example, suppose you
have a table that is partitioned by <date>. A query
such as SELECT max(id) FROM <example-data> WHERE date = '2010-10-10'
reads only the data files containing tuples whose date value matches
the one specified in the query.
How can I specify such filter condition in DataFrameReader API while reading a table?
As spark is lazily evaluated when your read the data using dataframe reader it is just added as a stage in the underlying DAG.
Now when you run SQL query over the data it is also added as another stage in the DAG.
And when you apply any action over the dataframe, then the DAG is evaluated and all the stages are optimized by catalyst optimized which in the end generated the most cost effective physical plan.
At the time of DAG evaluation predicate conditions are pushed down and only the required data is read into the memory.
DataFrameReader is created (available) exclusively using SparkSession.read.
That means it is created when the following code is executed (example of csv file load)
val df = spark.read.csv("path1,path2,path3")
Spark provides a pluggable Data Provider Framework (Data Source API) to rollout your own datasource. Basically, it provides interfaces that can be implemented for reading/writing to your custom datasource. That's where generally the partition pruning and predicate filter pushdowns are implemented.
Databricks spark supports many built-in datasources (along with predicate pushdown and partition pruning capabilities) as per https://docs.databricks.com/data/data-sources/index.html.
So, if the need is to load data from JDBC table and specify filter conditions, please see the following example
// Note: The parentheses are required.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
Please refer to more details here
https://docs.databricks.com/data/data-sources/sql-databases.html

What is the best way to join multiple jdbc connection tables in spark?

I'm trying to migrate a query to pyspark and need to join multiple tables in it. All the tables in question are in Redshift and I'm using the jdbc connector to talk to them.
My problem is how do I do these joins optimally without reading too much data in (i.e. load table and join on key) and without just blatantly using:
spark.sql("""join table1 on x=y join table2 on y=z""")
Is there a way to pushdown the queries to Redshift but still use the Spark df API for writing the logic and also utilizing df from spark context without saving them to Redshift just for the joins?
Please find next some points to consider:
The connector will push-down the specified filters only if there is any filter specified in your Spark code e.g select * from tbl where id > 10000. You can confirm that by yourself, just check the responsible Scala code. Also here is the corresponding test which demonstrates exactly that. The test test("buildWhereClause with multiple filters") tries to verify that the variable expectedWhereClause is equal to whereClause generated by the connector. The generated where clause should be:
"""
|WHERE "test_bool" = true
|AND "test_string" = \'Unicode是樂趣\'
|AND "test_double" > 1000.0
|AND "test_double" < 1.7976931348623157E308
|AND "test_float" >= 1.0
|AND "test_int" <= 43
|AND "test_int" IS NOT NULL
|AND "test_int" IS NULL
"""
which has occurred from the Spark-filters specified above.
The driver supports also column filtering. Meaning it will load only the required columns by pushing down the valid columns to redshift. You can again verify that from the corresponding Scala test("DefaultSource supports simple column filtering") and test("query with pruned and filtered scans").
Although in your case, you haven't specified any filters in your join query hence Spark can not leverage the two previous optimisations. If you are aware of such filters please feel free to apply them though.
Last but not least and as Salim already mentioned, the official Spark connector for redshift can be found here. The Spark connector is built on top of Amazon Redshift JDBC Driver therefore it will try to use it anyway as specified on the connector's code.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

How to list partition-pruned inputs for Hive tables?

I am using Spark SQL to query data in Hive. The data is partitioned and Spark SQL correctly prunes the partitions when querying.
However, I need to list either the source tables along with partition filters or the specific input files (.inputFiles would be an obvious choice for this but it does not reflect pruning) for a given query in order to determine on which part of the data the computation will be taking place.
The closest I was able to get was by calling df.queryExecution.executedPlan.collectLeaves(). This contains the relevant plan nodes as HiveTableScanExec instances. However, this class is private[hive] for the org.apache.spark.sql.hive package. I think the relevant fields are relation and partitionPruningPred.
Is there any way to achieve this?
Update: I was able to get the relevant information thanks to Jacek's suggestion and by using getHiveQlPartitions on the returned relation and providing partitionPruningPred as the parameter:
scan.findHiveTables(execPlan).flatMap(e => e.relation.getHiveQlPartitions(e.partitionPruningPred))
This contained all the data I needed, including the paths to all input files, properly partition pruned.
Well, you're asking for low-level details of the query execution and things are bumpy down there. You've been warned :)
As you noted in your comment, all the execution information are in this private[hive] HiveTableScanExec.
One way to get some insight into HiveTableScanExec physical operator (that is a Hive table at execution time) is to create a sort of backdoor in org.apache.spark.sql.hive package that is not private[hive].
package org.apache.spark.sql.hive
import org.apache.spark.sql.hive.execution.HiveTableScanExec
object scan {
def findHiveTables(execPlan: org.apache.spark.sql.execution.SparkPlan) = execPlan.collect { case hiveTables: HiveTableScanExec => hiveTables }
}
Change the code to meet your needs.
With the scan.findHiveTables, I usually use :paste -raw while in spark-shell to sneak into such "uncharted areas".
You could then simply do the following:
scala> spark.version
res0: String = 2.4.0-SNAPSHOT
// Create a Hive table
import org.apache.spark.sql.types.StructType
spark.catalog.createTable(
tableName = "h1",
source = "hive", // <-- that makes for a Hive table
schema = new StructType().add($"id".long),
options = Map.empty[String, String])
// select * from h1
val q = spark.table("h1")
val execPlan = q.queryExecution.executedPlan
scala> println(execPlan.numberedTreeString)
00 HiveTableScan [id#22L], HiveTableRelation `default`.`h1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#22L]
// Use the above code and :paste -raw in spark-shell
import org.apache.spark.sql.hive.scan
scala> scan.findHiveTables(execPlan).size
res11: Int = 1
relation field is the Hive table after it's been resolved using ResolveRelations and FindDataSourceTable logical rule that Spark analyzer uses to resolve data source and hive tables.
You can get pretty much all the information Spark uses from a Hive metastore using ExternalCatalog interface that is available as spark.sharedState.externalCatalog. That gives you pretty much all the metadata Spark uses to plan queries over Hive tables.

Resources