Why does Apache Spark read unnecessary Parquet columns within nested structures? - apache-spark

My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes".
But we're seeing unexpected columns being read for nested schema structures.
To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell:
// Preliminary setup
sc.setLogLevel("INFO")
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// Create a schema with nested complex structures
val schema = StructType(Seq(
StructField("F1", IntegerType),
StructField("F2", IntegerType),
StructField("Orig", StructType(Seq(
StructField("F1", StringType),
StructField("F2", StringType))))))
// Create some sample data
val data = spark.createDataFrame(
sc.parallelize(Seq(
Row(1, 2, Row("1", "2")),
Row(3, null, Row("3", "ABC")))),
schema)
// Save it
data.write.mode(SaveMode.Overwrite).parquet("data.parquet")
Then we read the file back into a DataFrame and project to a subset of columns:
// Read it back into another DataFrame
val df = spark.read.parquet("data.parquet")
// Select & show a subset of the columns
df.select($"F1", $"Orig.F1").show
When this runs we see the expected output:
+---+-------+
| F1|Orig_F1|
+---+-------+
| 1| 1|
| 3| 3|
+---+-------+
But... the query plan shows a slightly different story:
The "optimized plan" shows:
val projected = df.select($"F1", $"Orig.F1".as("Orig_F1"))
projected.queryExecution.optimizedPlan
// Project [F1#18, Orig#20.F1 AS Orig_F1#116]
// +- Relation[F1#18,F2#19,Orig#20] parquet
And "explain" shows:
projected.explain
// == Physical Plan ==
// *Project [F1#18, Orig#20.F1 AS Orig_F1#116]
// +- *Scan parquet [F1#18,Orig#20] Format: ParquetFormat, InputPaths: hdfs://sandbox.hortonworks.com:8020/user/stephenp/data.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<F1:int,Orig:struct<F1:string,F2:string>>
And the INFO logs produced during execution also confirm that the Orig.F2 column is unexpectedly read:
16/10/21 15:13:15 INFO parquet.ParquetReadSupport: Going to read the following fields from the Parquet file:
Parquet form:
message spark_schema {
optional int32 F1;
optional group Orig {
optional binary F1 (UTF8);
optional binary F2 (UTF8);
}
}
Catalyst form:
StructType(StructField(F1,IntegerType,true), StructField(Orig,StructType(StructField(F1,StringType,true), StructField(F2,StringType,true)),true))
According to the Dremel paper and the Parquet documentation, columns for complex nested structures should be independently stored and independently retrievable.
Questions:
Is this behavior a limitation of the current Spark query engine? In other words, does Parquet support optimally executing this query, but Spark's query planner is naive?
Or, is this a limitation of the current Parquet implementation?
Or, am I not using the Spark APIs correctly?
Or, am I misunderstanding how Dremel/Parquet column storage is supposed to work?
Possibly related: Why does the query performance differ with nested columns in Spark SQL?

It's a limitation on the Spark query engine at the moment, the relevant JIRA ticket is below, spark only handles predicate pushdown of simple types in Parquet, not nested StructTypes
https://issues.apache.org/jira/browse/SPARK-17636

The issue has been fixed since Spark 2.4.0. This applies to struct as well as array of structs.
Before Spark 3.0.0:
Set spark.sql.optimizer.nestedSchemaPruning.enabled to true
See related Jira here: https://issues.apache.org/jira/browse/SPARK-4502
After Spark 3.0.0:
spark.sql.optimizer.nestedSchemaPruning.enabled now default is true
Related Jira here: https://issues.apache.org/jira/browse/SPARK-29805
Also related SO question: Efficient reading nested parquet column in Spark

Related

how to get the partitions info of hive table in Spark

I want to execute the SQL by Spark like this.
sparkSession.sql("select * from table")
But I want to have a partition check on the table before execution avoiding fullscan.
If the table is a partitioned table, my program will force users to add a partition filter. If not it's ok to run.
So my question is how to know whether a table is a partitioned table?
My thought is that reading info from metastore. but how to get metastore is another problem I encounter. Could someone help?
Assuming that your real goal is to restrict execution of unbounded queries, I think it would be easier to get query's execution plan and look under its FileScan / HiveTableScan leaf nodes to see if any partition filters are being applied. For partitioned tables, number of partitions that query is actually going to scan will also be presented, by the way. So, something like this should do:
scala> val df_unbound = spark.sql("select * from hottab")
df_unbound: org.apache.spark.sql.DataFrame = [id: int, descr: string ... 1 more field]
scala> val plan1 = df_unbound.queryExecution.executedPlan.toString
plan1: String =
"*(1) FileScan parquet default.hottab[id#0,descr#1,loaddate#2] Batched: true, Format: Parquet,
Location: CatalogFileIndex[hdfs://ns1/user/hive/warehouse/hottab],
PartitionCount: 365, PartitionFilters: [],
PushedFilters: [], ReadSchema: struct<id:int,descr:string>
"
scala> val df_filtered = spark.sql("select * from hottab where loaddate='2019-07-31'")
df_filtered: org.apache.spark.sql.DataFrame = [id: int, descr: string ... 1 more field]
scala> val plan2 = df_filtered.queryExecution.executedPlan.toString
plan2: String =
"*(1) FileScan parquet default.hottab[id#17,descr#18,loaddate#19] Batched: true, Format: Parquet,
Location: PrunedInMemoryFileIndex[hdfs://ns1/user/hive/warehouse/hottab/loaddate=2019-07-31],
PartitionCount: 1, PartitionFilters: [isnotnull(loaddate#19), (loaddate#19 = 2019-07-31)],
PushedFilters: [], ReadSchema: struct<id:int,descr:string>
"
This way, you also don't have to deal with SQL parsing to find table name(s) from queries, and to interrogate metastore yourself.
As a bonus, you'll be also able to see if "regular" filter pushdown occurs (for storage formats that support it) in addition to partition pruning.
You can use Scala's Try class and execute show partitions on the required table.
val numPartitions = Try(spark.sql("show partitions database.table").count) match {
case Success(v) => v
case Failure(e) => -1
}
Later you can check numPartitions. If the value is -1 then the table is not partitioned.
val listPartitions = spark.sessionState.catalog.listPartitionNames(TableIdentifier("table_name", Some("db name")))
listPartitions: Seq[String] = ArrayBuffer(partition1=value1, ... ) // partition table
listPartitions: Seq[String] = ArrayBuffer() // not partition table
I know this is late, but this might help someone
spark.sql("describe detail database.table").select("partitionColumns").show(false)
this is give the row with the partitioned columns in a array

Filtering Dataframe with predicate pushdown from another dataframe

How can I push down a filter to a dataframe reading based on another dataframe I have? Basically want to avoid reading the second dataframe entirely and then doing an inner join. Instead I would like to just submit a filter on the reading to filter at source. Even if I use an inner joined wrapped up with the read, the plan doesn't show that it is getting filtered. I feel like there is definitely a better way to set this up. Using Spark 2.x I have this so far but I want to avoid collecting a List as below:
// Don't want to do this collect...too slow
val idFilter = df1.select("id").distinct().map(r => r.getLong(0)).collect.toList
val df2: DataFrame = spark.read.format("parquet").load("<path>")
.filter($"id".isin(idFilter: _*))
You cannot directly use predicate pushdown unless you are implementing a DataSource yourself. Predicate Pushdown is a mechanism provided by the Spark Datasources which must be implemented by each Datasources individually.
For file based Datasources there is already a simple mechanism in place based on partitioning on Disk.
consider the following DataFrame:
val df = Seq(("test", "day1"), ("test2", "day2")).toDF("data", "day")
If we save that DataFrame to disk the following way:
df.write.partitionBy("day").save("/tmp/data")
The result will be the following folder structure
tmp -
|
| - data - |
|
|--day=day1 -|- part1....parquet
| |- part2....parquet
|
|--day=day2 -|- part1....parquet
|- part2....parquet
If you are now using this datasource like this:
spark.read.load("/tmp/data").filter($"day" = "day1").show()
Spark doesn't even bother loading the data of folder day2 as there is no need for it.
This is one type of predicate pushdown which works for every standard file format spark supports.
A more specific mechanism would be parquet. Parquet is a columnar based file format which means its quit easy to filter out columns. If you have parquet based files with 3 columns a, b, c in a file /tmp/myparquet.parquet the following query:
spark.read.parquet("/tmp/myparquet.parquet").select("a").show()
will result in an internal predicate pushdown where spark is only fetching data for column a without reading data for the columns b or c.
If someone is interested that mechanisms are established by implementing this trait:
/**
* A BaseRelation that can eliminate unneeded columns and filter using selected
* predicates before producing an RDD containing all matching tuples as Row objects.
*
* The actual filter should be the conjunction of all `filters`,
* i.e. they should be "and" together.
*
* The pushed down filters are currently purely an optimization as they will all be evaluated
* again. This means it is safe to use them with methods that produce false positives such
* as filtering partitions based on a bloom filter.
*
* #since 1.3.0
*/
#Stable
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
to be found in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

How to use schema from one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0?

How to use schema from one of the columns of a dataset to parse another column and create a flattened dataset using Spark Streaming 2.2.0?
I have the following source data frame that I create from reading messages from Kafka
col1: string
col2: json string
col1 | col2
---------------------------------------------------------------------------
schemaUri1 | "{"name": "foo", "zipcode": 11111}"
schemaUri2 | "{"name": "bar", "zipcode": 11112, "id": 1234}"
schemaUri1 | "{"name": "foobar", "zipcode": 11113}"
schemaUri2 | "{"name": "barfoo", "zipcode": 11114, "id": 1235, "interest": "reading"}"
My target data frame
name | zipcode | id | interest
--------------------------------
foo | 11111 | null | null
bar | 11112 | 1234 | null
foobar | 11113 | null | null
barfoo | 11114 | 1235 | reading
Assume you have the following function
// This function returns a StructType that represents a schema for a given schemaUri
public StructType getSchema(String schemaUri)
Schema column does not matter (and cannot be used with Spark API anyway) for the problem. All that is relevant is the columns you want to extract:
val names = Seq("name", "zipcode", "id", "interest")
df.select(names.map(s => get_json_object($"col2", s"$$.${s}") as s): _*)
or:
import org.apache.spark.sql.types._
val superSchema = StructType(Seq(
StructField("name", StringType),
StructField("zipcode", IntegerType),
StructField("id", LongType),
StructField("interest", StringType)
))
df.select(from_json($"col2", superSchema).alias("_")).select($"_.*")
This is a great example of ill-defined question. In the holiday spirit, let's ignore lack of attempt and focus on actual problems:
Structured Streaming is, well... structured. It means it requires a well defined schema. For this reason it for example disables schema inference.
Schema provided as a field reference is useless:
It cannot be used with existing API (form example from_json can only use string literal).
If it could be used, it is not possible to propagate this information back to the planner.
Finally it is obsolete - JSON itself is self describing and doesn't require schema for parsing. The reason why Spark functions need this information, is because planner requires it to compute execution plan, before query has been started.
Even if you could parse the data another problem is introduced by your comment:
I might have problem because I dont know the schema ahead of the time
If you don't know the schema, then there is very little you can do with the resulting dataset. At the end of the day you're not better with the parsed data, than you'd be with JSON BLOB.
No comes real question - what exactly is the problem you are trying to solve? This is once again missing from the question, but we can suspect on of two scenarios:
You have a stream of unrelated data (unlikely). Possible solution here is to write data to separate Kafka topic to demultiplex
stream.select($"col1" as "topic", $"col2" as "value").writeStream
.format("kafka")
.option("kafka.bootstrap.servers", ...)
.start()
and create separate input stream for each topic, with already known schema.
Schema evolution. In that case define API for retrieving latest known schema.
If all variants are compatible, use it to parse data as already shown in this thread.
Otherwise redefine getSchema to return transformation function to the latest known schema.
Keep schema constant across the lifetime of the query, if you want to upgrade - dispose old query and create a new one.

Why does Complete output mode require aggregation?

I work with the latest Structured Streaming in Apache Spark 2.2 and got the following exception:
org.apache.spark.sql.AnalysisException: Complete output mode not
supported when there are no streaming aggregations on streaming
DataFrames/Datasets;;
Why does Complete output mode require a streaming aggregation? What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?
scala> spark.version
res0: String = 2.2.0
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.SQLContext
implicit val sqlContext: SQLContext = spark.sqlContext
val source = MemoryStream[(Int, Int)]
val ids = source.toDS.toDF("time", "id").
withColumn("time", $"time" cast "timestamp"). // <-- convert time column from Int to Timestamp
dropDuplicates("id").
withColumn("time", $"time" cast "long") // <-- convert time column back from Timestamp to Int
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
scala> val q = ids.
| writeStream.
| format("memory").
| queryName("dups").
| outputMode(OutputMode.Complete). // <-- memory sink supports checkpointing for Complete output mode only
| trigger(Trigger.ProcessingTime(30.seconds)).
| option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save state between restarts
| start
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
Project [cast(time#10 as bigint) AS time#15L, id#6]
+- Deduplicate [id#6], true
+- Project [cast(time#5 as timestamp) AS time#10, id#6]
+- Project [_1#2 AS time#5, _2#3 AS id#6]
+- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2, _2#3]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:247)
... 57 elided
From the Structured Streaming Programming Guide - other queries (excluding aggregations, mapGroupsWithState and flatMapGroupsWithState):
Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.
To answer the question:
What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?
Probably OOM.
The puzzling part is why dropDuplicates("id") is not marked as aggregation.
I think the problem is the output mode. instead of using OutputMode.Complete, use OutputMode.Append as shown below.
scala> val q = ids
.writeStream
.format("memory")
.queryName("dups")
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime(30.seconds))
.option("checkpointLocation", "checkpoint-dir")
.start

How to read a nested collection in Spark

I have a parquet table with one of the columns being
, array<struct<col1,col2,..colN>>
Can run queries against this table in Hive using LATERAL VIEW syntax.
How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark?
Could not find any references to this in Spark documentation. Thanks in advance for any information!
ps. I felt might be helpful to give some stats on the table.
Number of columns in main table ~600. Number of rows ~200m.
Number of "columns" in nested collection ~10. Avg number of records in nested collection ~35.
There is no magic in the case of nested collection. Spark will handle the same way a RDD[(String, String)] and a RDD[(String, Seq[String])].
Reading such nested collection from Parquet files can be tricky, though.
Let's take an example from the spark-shell (1.3.1):
scala> import sqlContext.implicits._
import sqlContext.implicits._
scala> case class Inner(a: String, b: String)
defined class Inner
scala> case class Outer(key: String, inners: Seq[Inner])
defined class Outer
Write the parquet file:
scala> val outers = sc.parallelize(List(Outer("k1", List(Inner("a", "b")))))
outers: org.apache.spark.rdd.RDD[Outer] = ParallelCollectionRDD[0] at parallelize at <console>:25
scala> outers.toDF.saveAsParquetFile("outers.parquet")
Read the parquet file:
scala> import org.apache.spark.sql.catalyst.expressions.Row
import org.apache.spark.sql.catalyst.expressions.Row
scala> val dataFrame = sqlContext.parquetFile("outers.parquet")
dataFrame: org.apache.spark.sql.DataFrame = [key: string, inners: array<struct<a:string,b:string>>]
scala> val outers = dataFrame.map { row =>
| val key = row.getString(0)
| val inners = row.getAs[Seq[Row]](1).map(r => Inner(r.getString(0), r.getString(1)))
| Outer(key, inners)
| }
outers: org.apache.spark.rdd.RDD[Outer] = MapPartitionsRDD[8] at map at DataFrame.scala:848
The important part is row.getAs[Seq[Row]](1). The internal representation of a nested sequence of struct is ArrayBuffer[Row], you could use any super-type of it instead of Seq[Row]. The 1 is the column index in the outer row. I used the method getAs here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.
Now that you have a RDD[Outer], you can apply any wanted transformation or action.
// Filter the outers
outers.filter(_.inners.nonEmpty)
// Filter the inners
outers.map(outer => outer.copy(inners = outer.inners.filter(_.a == "a")))
Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.
dataFrame.select('col1, 'col2).map { row => ... }
I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.
The explode function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.
Create a test dataframe:
from pyspark.sql import Row
df = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3]), Row(a=2, intlist=[4,5,6])])
df.show()
## +-+--------------------+
## |a| intlist|
## +-+--------------------+
## |1|ArrayBuffer(1, 2, 3)|
## |2|ArrayBuffer(4, 5, 6)|
## +-+--------------------+
Use explode to flatten the list column:
from pyspark.sql.functions import explode
df.select(df.a, explode(df.intlist)).show()
## +-+---+
## |a|_c0|
## +-+---+
## |1| 1|
## |1| 2|
## |1| 3|
## |2| 4|
## |2| 5|
## |2| 6|
## +-+---+
Another approach would be using pattern matching like this:
val rdd: RDD[(String, List[(String, String)]] = dataFrame.map(_.toSeq.toList match {
case List(key: String, inners: Seq[Row]) => key -> inners.map(_.toSeq.toList match {
case List(a:String, b: String) => (a, b)
}).toList
})
You can pattern match directly on Row but it is likely to fail for a few reasons.
Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.
Here's example how to use explode() in SQL directly to query nested collection.
SELECT hholdid, tsp.person_seq_no
FROM ( SELECT hholdid, explode(tsp_ids) as tsp
FROM disc_mrt.unified_fact uf
)
tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.
Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.
Have a look also on following JIRAs for Hive-compatible way to query nested data using LATERAL VIEW OUTER syntax, since Spark 2.2 also supports OUTER explode (e.g. when a nested collection is empty, but you still want to have attributes from a parent record):
SPARK-13721: Add support for LATERAL VIEW OUTER explode()
Noticable not resolved JIRA on explode() for SQL access:
SPARK-7549: Support aggregating over nested fields

Resources