Why does Complete output mode require aggregation? - apache-spark

I work with the latest Structured Streaming in Apache Spark 2.2 and got the following exception:
org.apache.spark.sql.AnalysisException: Complete output mode not
supported when there are no streaming aggregations on streaming
DataFrames/Datasets;;
Why does Complete output mode require a streaming aggregation? What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?
scala> spark.version
res0: String = 2.2.0
import org.apache.spark.sql.execution.streaming.MemoryStream
import org.apache.spark.sql.SQLContext
implicit val sqlContext: SQLContext = spark.sqlContext
val source = MemoryStream[(Int, Int)]
val ids = source.toDS.toDF("time", "id").
withColumn("time", $"time" cast "timestamp"). // <-- convert time column from Int to Timestamp
dropDuplicates("id").
withColumn("time", $"time" cast "long") // <-- convert time column back from Timestamp to Int
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
import scala.concurrent.duration._
scala> val q = ids.
| writeStream.
| format("memory").
| queryName("dups").
| outputMode(OutputMode.Complete). // <-- memory sink supports checkpointing for Complete output mode only
| trigger(Trigger.ProcessingTime(30.seconds)).
| option("checkpointLocation", "checkpoint-dir"). // <-- use checkpointing to save state between restarts
| start
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
Project [cast(time#10 as bigint) AS time#15L, id#6]
+- Deduplicate [id#6], true
+- Project [cast(time#5 as timestamp) AS time#10, id#6]
+- Project [_1#2 AS time#5, _2#3 AS id#6]
+- StreamingExecutionRelation MemoryStream[_1#2,_2#3], [_1#2, _2#3]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:297)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:115)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:247)
... 57 elided

From the Structured Streaming Programming Guide - other queries (excluding aggregations, mapGroupsWithState and flatMapGroupsWithState):
Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.
To answer the question:
What would happen if Spark allowed Complete output mode with no aggregations in a streaming query?
Probably OOM.
The puzzling part is why dropDuplicates("id") is not marked as aggregation.

I think the problem is the output mode. instead of using OutputMode.Complete, use OutputMode.Append as shown below.
scala> val q = ids
.writeStream
.format("memory")
.queryName("dups")
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime(30.seconds))
.option("checkpointLocation", "checkpoint-dir")
.start

Related

How to write a table to hive from spark without using the warehouse connector in HDP 3.1

when trying to use spark 2.3 on HDP 3.1 to write to a Hive table without the warehouse connector directly into hives schema using:
spark-shell --driver-memory 16g --master local[3] --conf spark.hadoop.metastore.catalog.default=hive
val df = Seq(1,2,3,4).toDF
spark.sql("create database foo")
df.write.saveAsTable("foo.my_table_01")
fails with:
Table foo.my_table_01 failed strict managed table checks due to the following reason: Table is marked as a managed table but is not transactional
but a:
val df = Seq(1,2,3,4).toDF.withColumn("part", col("value"))
df.write.partitionBy("part").option("compression", "zlib").mode(SaveMode.Overwrite).format("orc").saveAsTable("foo.my_table_02")
Spark with spark.sql("select * from foo.my_table_02").show works just fine.
Now going to Hive / beeline:
0: jdbc:hive2://hostname:2181/> select * from my_table_02;
Error: java.io.IOException: java.lang.IllegalArgumentException: bucketId out of range: -1 (state=,code=0)
A
describe extended my_table_02;
returns
+-----------------------------+----------------------------------------------------+----------+
| col_name | data_type | comment |
+-----------------------------+----------------------------------------------------+----------+
| value | int | |
| part | int | |
| | NULL | NULL |
| # Partition Information | NULL | NULL |
| # col_name | data_type | comment |
| part | int | |
| | NULL | NULL |
| Detailed Table Information | Table(tableName:my_table_02, dbName:foo, owner:hive/bd-sandbox.t-mobile.at#SANDBOX.MAGENTA.COM, createTime:1571201905, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:value, type:int, comment:null), FieldSchema(name:part, type:int, comment:null)], location:hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, inputFormat:org.apache.hadoop.hive.ql.io.orc.OrcInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.orc.OrcSerde, parameters:{path=hdfs://bd-sandbox.t-mobile.at:8020/warehouse/tablespace/external/hive/foo.db/my_table_02, compression=zlib, serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[FieldSchema(name:part, type:int, comment:null)], parameters:{numRows=0, rawDataSize=0, spark.sql.sources.schema.partCol.0=part, transient_lastDdlTime=1571201906, bucketing_version=2, spark.sql.create.version=2.3.2.3.1.0.0-78, totalSize=740, spark.sql.sources.schema.numPartCols=1, spark.sql.sources.schema.part.0={\"type\":\"struct\",\"fields\":[{\"name\":\"value\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"part\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}, numFiles=4, numPartitions=4, spark.sql.partitionProvider=catalog, spark.sql.sources.schema.numParts=1, spark.sql.sources.provider=orc, transactional=true}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER, writeId:-1) |
How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?
To my best knowledge external tables should be possible (thy are not managed, not ACID not transactional), but I am not sure how to tell the saveAsTable how to handle these.
edit
related issues:
https://community.cloudera.com/t5/Support-Questions/In-hdp-3-0-can-t-create-hive-table-in-spark-failed/td-p/202647
Table loaded through Spark not accessible in Hive
setting the properties there proposed in the answer do not solve my issue
seems also to be a bug: https://issues.apache.org/jira/browse/HIVE-20593
Might be a workaround like the https://github.com/qubole/spark-acid like https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.4/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html but I do not like the idea of using more duct tape where I have not seen any large scale performance tests just yet. Also, this means changing all existing spark jobs.
In fact Cant save table to hive metastore, HDP 3.0 reports issues with large data frames and the warehouse connector.
edit
I just found https://community.cloudera.com/t5/Support-Questions/Spark-hive-warehouse-connector-not-loading-data-when-using/td-p/243613
And:
execute() vs executeQuery()
ExecuteQuery() will always use the Hiveserver2-interactive/LLAP as it
uses the fast ARROW protocol. Using it when the jdbc URL point to the
non-LLAP Hiveserver2 will yield an error.
Execute() uses JDBC and does not have this dependency on LLAP, but has
a built-in restriction to only return 1.000 records max. But for most
queries (INSERT INTO ... SELECT, count, sum, average) that is not a
problem.
But doesn't this kill any high-performance interoperability between hive and spark? Especially if there are not enough LLAP nodes available for large scale ETL.
In fact, this is true. This setting can be configured at https://github.com/hortonworks-spark/spark-llap/blob/26d164e62b45cfa1420d5d43cdef13d1d29bb877/src/main/java/com/hortonworks/spark/sql/hive/llap/HWConf.java#L39, though I am not sure of the performance impact of increasing this value
Did you try
data.write \
.mode("append") \
.insertInto("tableName")
Inside Ambari simply disabling the option of creating transactional tables by default solves my problem.
set to false twice (tez, llap)
hive.strict.managed.tables = false
and enable manually in each table property if desired (to use a transactional table).
Creating an external table (as a workaround) seems to be the best option for me.
This still involves HWC to register the column metadata or update the partition information.
Something along these lines:
val df:DataFrame = ...
val externalPath = "/warehouse/tablespace/external/hive/my_db.db/my_table"
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
dxx.write.partitionBy("part_col").option("compression", "zlib").mode(SaveMode.Overwrite).orc(externalPath)
val columns = dxx.drop("part_col").schema.fields.map(field => s"${field.name} ${field.dataType.simpleString}").mkString(", ")
val ddl =
s"""
|CREATE EXTERNAL TABLE my_db.my_table ($columns)
|PARTITIONED BY (part_col string)
|STORED AS ORC
|Location '$externalPath'
""".stripMargin
hive.execute(ddl)
hive.execute(s"MSCK REPAIR TABLE $tablename SYNC PARTITIONS")
Unfortunately, this throws a:
java.sql.SQLException: The query did not generate a result set!
from HWC
"How can I use spark to write to hive without using the warehouse connector but still writing to the same metastore which can later on be read by hive?"
We are working on the same setting (HDP 3.1 with Spark 2.3). Using below code we were getting the same error messages as you got "bucketId out of range: -1". The solution was to run set hive.fetch.task.conversion=none; in Hive shell before trying to query the table.
The code to write data into Hive without the HWC:
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
case class Record(key: Int, value: String)
val spark = SparkSession.builder()
.master("yarn")
.appName("SparkHiveExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
spark.sql("USE databaseName")
val recordsDF = spark.createDataFrame((1 to 100).map(i => Record(i, s"val_$i")))
recordsDF.write.mode(SaveMode.Overwrite).format("orc").saveAsTable("sparkhive_records")
[Example from https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]

How to read data from a csv file as a stream

I have the following table:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I am able to sort the entries as a batch process.
scala> dataDS.sort(col("count")).show(100);
I now want to try if I can do the same using streaming. To do this, I suppose I will have to read the file as a stream.
scala> val staticSchema = dataDS.schema;
staticSchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))
scala> val dataStream = spark.
| readStream.
| schema(staticSchema).
| option("header","true").
| csv("data/flight-data/csv/2015-summary.csv");
dataStream: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
scala> dataStream.isStreaming;
res245: Boolean = true
But I am not able to progress further w.r.t. how to read the data as a stream.
I have executed the sort transformation` process
scala> dataStream.sort(col("count"));
res246: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
I suppose now I should use Dataset's writeStream method. I ran the following two commands but both returned errors.
scala> dataStream.sort(col("count")).writeStream.
| format("memory").
| queryName("sorted_data").
| outputMode("complete").
| start();
org.apache.spark.sql.AnalysisException: Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;;
and this one
scala> dataStream.sort(col("count")).writeStream.
| format("memory").
| queryName("sorted_data").
| outputMode("append").
| start();
org.apache.spark.sql.AnalysisException: Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;
From the errors, it seems I should be aggregating (group) data but I thought I don't need to do it as I can run any batch operation as a stream.
How can I understand how to sort data which arrives as a stream?
Unfortunately what the error messages tell you is accurate.
Sorting is supported only in complete mode (i.e. when each window returns complete dataset).
Complete mode requires aggregation (otherwise it would require unbounded memory - Why does Complete output mode require aggregation?)
The point you make:
but I thought I don't need to do it as I can run any batch operation as a stream.
is not without merit, but it misses a fundamental point, that Structured Streaming is not tightly bound to micro-batching.
One could easily come up with some unscalable hack
import org.apache.spark.sql.functions._
dataStream
.withColumn("time", window(current_timestamp, "5 minute")) // Some time window
.withWatermark("time", "0 seconds") // Immediate watermark
.groupBy("time")
.agg(sort_array(collect_list(struct($"count", $"DEST_COUNTRY_NAME", $"ORIGIN_COUNTRY_NAME"))).as("data"))
.withColumn("data", explode($"data"))
.select($"data.*")
.select(df.columns.map(col): _*)
.writeStream
.outputMode("append")
...
.start()

Getting error saying "Queries with streaming sources must be executed with writeStream.start()" on spark structured streaming [duplicate]

This question already has answers here:
How to display a streaming DataFrame (as show fails with AnalysisException)?
(2 answers)
Closed 4 years ago.
I am getting some issues while executing spark SQL on top of spark structures streaming.
PFA for error.
here is my code
object sparkSqlIntegration {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.config("spark.sql.streaming.checkpointLocation", "file:///C:/checkpoint")
.getOrCreate()
setupLogging()
val userSchema = new StructType().add("name", "string").add("age", "integer")
// Create a stream of text files dumped into the logs directory
val rawData = spark.readStream.option("sep", ",").schema(userSchema).csv("file:///C:/Users/R/Documents/spark-poc-centri/csvFolder")
// Must import spark.implicits for conversion to DataSet to work!
import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")
val query = rawData.writeStream.outputMode("append").format("console").start()
// Keep going until we're stopped.
query.awaitTermination()
spark.stop()
}
}
During execution, I am getting the following error. As I am new to streaming can anyone tell how can I execute spark SQL queries on spark structured streaming
2018-12-27 16:02:40 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, LAPTOP-5IHPFLOD, 6829, None)
2018-12-27 16:02:41 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6731787b{/metrics/json,null,AVAILABLE,#Spark}
sql results here
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
FileSource[file:///C:/Users/R/Documents/spark-poc-centri/csvFolder]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:392)
You don't need any of these lines
import spark.implicits._
rawData.createOrReplaceTempView("updates")
val sqlResult= spark.sql("select * from updates")
println("sql results here")
sqlResult.show()
println("Otheres")
Most importantly, select * isn't needed. When you print the dataframe, you would already see all the columns. Therefore, you also don't need to register the temp view to give it a name.
And when you format("console"), that eliminates the need for .show()
Refer to the Spark examples for reading from a network socket and output to console.
val words = // omitted ... some Streaming DataFrame
// Generating a running word count
val wordCounts = words.groupBy("value").count()
// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
Take away - use DataFrame operations like .select() and .groupBy() rather than raw SQL
Or you can use Spark Streaming, as shown in those examples, you need to foreachRDD over each stream batch, then convert these to a DataFrame, which you can query
/** Case class for converting RDD to DataFrame */
case class Record(word: String)
val words = // omitted ... some DStream
// Convert RDDs of the words DStream to DataFrame and run SQL query
words.foreachRDD { (rdd: RDD[String], time: Time) =>
// Get the singleton instance of SparkSession
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
import spark.implicits._
// Convert RDD[String] to RDD[case class] to DataFrame
val wordsDataFrame = rdd.map(w => Record(w)).toDF()
// Creates a temporary view using the DataFrame
wordsDataFrame.createOrReplaceTempView("words")
// Do word count on table using SQL and print it
val wordCountsDataFrame =
spark.sql("select word, count(*) as total from words group by word")
println(s"========= $time =========")
wordCountsDataFrame.show()
}
ssc.start()
ssc.awaitTermination()

Spark Streaming aggregation and filter in the same window

I've a fairly easy task - events are coming in and I want to filter those with higher value than the average per group by key in the same window.
I think this this is the relevant part of the code:
val avgfuel = events
.groupBy(window($"enqueuedTime", "30 seconds"), $"weatherCondition")
.agg(avg($"fuelEfficiencyPercentage") as "avg_fuel")
val joined = events.join(avgfuel, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"avg_fuel")
val streamingQuery1 = joined.writeStream
.outputMode("append").
.trigger(Trigger.ProcessingTime("10 seconds")).
.option("checkpointLocation", checkpointLocation).
.format("json").option("path", containerOutputLocation).start()
events is a DStream.
The problem is that I'm getting empty files in the output location.
I'm using Databricks 3.5 - Spark 2.2.1 with Scala 2.11
What have I done wrong?
Thanks!
EDIT: a more complete code -
val inputStream = spark.readStream
.format("eventhubs") // working with azure event hubs
.options(eventhubParameters)
.load()
val schema = (new StructType)
.add("id", StringType)
.add("latitude", StringType)
.add("longitude", StringType)
.add("tirePressure", FloatType)
.add("fuelEfficiencyPercentage", FloatType)
.add("weatherCondition", StringType)
val df1 = inputStream.select($"body".cast("string").as("value")
, from_unixtime($"enqueuedTime").cast(TimestampType).as("enqueuedTime")
).withWatermark("enqueuedTime", "1 minutes")
val df2 = df1.select(from_json(($"value"), schema).as("body")
, $"enqueuedTime")
val df3 = df2.select(
$"enqueuedTime"
, $"body.id".cast("integer")
, $"body.latitude".cast("float")
, $"body.longitude".cast("float")
, $"body.tirePressure"
, $"body.fuelEfficiencyPercentage"
, $"body.weatherCondition"
)
val avgfuel = df3
.groupBy(window($"enqueuedTime", "10 seconds"), $"weatherCondition" )
.agg(avg($"fuelEfficiencyPercentage") as "fuel_avg", stddev($"fuelEfficiencyPercentage") as "fuel_stddev")
.select($"weatherCondition", $"fuel_avg")
val broadcasted = sc.broadcast(avgfuel)
val joined = df3.join(broadcasted.value, Seq("weatherCondition"))
.filter($"fuelEfficiencyPercentage" > $"fuel_avg")
val streamingQuery1 = joined.writeStream.
outputMode("append").
trigger(Trigger.ProcessingTime("10 seconds")).
option("checkpointLocation", checkpointLocation).
format("json").option("path", outputLocation).start()
This executes without errors and after a while results are starting to be written. I might be due to the broadcast of the aggregation result but I'm not sure.
Small investigation ;)
Events can't be a DStream, because you have option to use Dataset operations on it - it must be a Dataset
Stream-Stream joins are not allowed in Spark 2.2. I've tried to run your code with events as rate source and I get:
org.apache.spark.sql.AnalysisException: Inner join between two streaming DataFrames/Datasets is not supported;;
Join Inner, (value#1L = eventValue#41L)
Result is quite unexpected - probably you used read instead of readStream and you didn't create a Streaming Dataset, but static. Change it to readStream and it will work - of course after upgrade to 2.3
Code - without comments above - is correct and should run correctly on Spark 2.3. Note that you must also change mode to complete instead of append, because you are doing aggregation

Why does Apache Spark read unnecessary Parquet columns within nested structures?

My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes".
But we're seeing unexpected columns being read for nested schema structures.
To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell:
// Preliminary setup
sc.setLogLevel("INFO")
import org.apache.spark.sql.types._
import org.apache.spark.sql._
// Create a schema with nested complex structures
val schema = StructType(Seq(
StructField("F1", IntegerType),
StructField("F2", IntegerType),
StructField("Orig", StructType(Seq(
StructField("F1", StringType),
StructField("F2", StringType))))))
// Create some sample data
val data = spark.createDataFrame(
sc.parallelize(Seq(
Row(1, 2, Row("1", "2")),
Row(3, null, Row("3", "ABC")))),
schema)
// Save it
data.write.mode(SaveMode.Overwrite).parquet("data.parquet")
Then we read the file back into a DataFrame and project to a subset of columns:
// Read it back into another DataFrame
val df = spark.read.parquet("data.parquet")
// Select & show a subset of the columns
df.select($"F1", $"Orig.F1").show
When this runs we see the expected output:
+---+-------+
| F1|Orig_F1|
+---+-------+
| 1| 1|
| 3| 3|
+---+-------+
But... the query plan shows a slightly different story:
The "optimized plan" shows:
val projected = df.select($"F1", $"Orig.F1".as("Orig_F1"))
projected.queryExecution.optimizedPlan
// Project [F1#18, Orig#20.F1 AS Orig_F1#116]
// +- Relation[F1#18,F2#19,Orig#20] parquet
And "explain" shows:
projected.explain
// == Physical Plan ==
// *Project [F1#18, Orig#20.F1 AS Orig_F1#116]
// +- *Scan parquet [F1#18,Orig#20] Format: ParquetFormat, InputPaths: hdfs://sandbox.hortonworks.com:8020/user/stephenp/data.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<F1:int,Orig:struct<F1:string,F2:string>>
And the INFO logs produced during execution also confirm that the Orig.F2 column is unexpectedly read:
16/10/21 15:13:15 INFO parquet.ParquetReadSupport: Going to read the following fields from the Parquet file:
Parquet form:
message spark_schema {
optional int32 F1;
optional group Orig {
optional binary F1 (UTF8);
optional binary F2 (UTF8);
}
}
Catalyst form:
StructType(StructField(F1,IntegerType,true), StructField(Orig,StructType(StructField(F1,StringType,true), StructField(F2,StringType,true)),true))
According to the Dremel paper and the Parquet documentation, columns for complex nested structures should be independently stored and independently retrievable.
Questions:
Is this behavior a limitation of the current Spark query engine? In other words, does Parquet support optimally executing this query, but Spark's query planner is naive?
Or, is this a limitation of the current Parquet implementation?
Or, am I not using the Spark APIs correctly?
Or, am I misunderstanding how Dremel/Parquet column storage is supposed to work?
Possibly related: Why does the query performance differ with nested columns in Spark SQL?
It's a limitation on the Spark query engine at the moment, the relevant JIRA ticket is below, spark only handles predicate pushdown of simple types in Parquet, not nested StructTypes
https://issues.apache.org/jira/browse/SPARK-17636
The issue has been fixed since Spark 2.4.0. This applies to struct as well as array of structs.
Before Spark 3.0.0:
Set spark.sql.optimizer.nestedSchemaPruning.enabled to true
See related Jira here: https://issues.apache.org/jira/browse/SPARK-4502
After Spark 3.0.0:
spark.sql.optimizer.nestedSchemaPruning.enabled now default is true
Related Jira here: https://issues.apache.org/jira/browse/SPARK-29805
Also related SO question: Efficient reading nested parquet column in Spark

Resources