Spark in docker parquet error No predefined schema found - apache-spark

I have a https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
Trying to read a parquet file with sparkR this exception occurs. Reading a parquet file works without any problems on a local spark installation.
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
Strangely the same error is the same - even for not existing files.
However in the terminal I can see that the files are there:
/mappedFolder/myFile.parquet
root#worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....

My initial parquet file seems to have been corrupted during my test runs of the dockerized spark.
To solve: re-create parquet files from original sources

Related

Spark : java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable

Received this error (java.lang.ClassCastException: org.apache.Hadoop.io.Text cannot be cast to org.apache.orc.storage.serde2.io.DateWritable) while executing a pyspark py file which is reading the data from orc files in a partitioned folder.
This input folder has data, which should be read, transformed and needs to be written to a folder which has existing external table built on top(MSCK repair will be run post writing data to this target folder)
Code sample(process)
Step 1:
Df = spark.read.orc(“input_path”)
Step 2:
—> apply transformations(No cast function used)
Step 3:
Transformed_Df.write\
.partitionBy(“columns”)\
.mode(“overwrite”)\
.orc(“output_path”)
When I checked the logs, I see this error occurs multiple times right after reading partitions. I believe this is happening before applying transformations and writing the data to target.
Attached a picture of the log, please check.
enter image description here

How to Ignore Empty Parquet Files in Databricks When Creating Dataframe

I am having issues loading multiple files into a dataframe in Databricks. When I load a parquet file in an individual folder, it is fine, but the following error returns when I try to load multiple files in the dataframe:
DF = spark.read.parquet('S3 path/')
"org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually."
Per other StackOverflow answers, I added spark.sql.files.ignoreCorruptFiles true to the cluster configuration but it didn't seem to resolve the issue. Any other ideas?

disabling _spark_metadata in Structured streaming in spark 2.3.0

My Structured Streaming application is writing to parquet and i want to get rid of the _spark_metadata folder its creating. I used below property and it seems fine
--conf "spark.hadoop.parquet.enable.summary-metadata=false"
When the application starts no _spark_metadata folder is generated. But once it moves to RUNNING status and starts processing messages, it's failing with the below error saying _spark_metadata folder doesn't exist. Seems structured stream is relying on this folder without which we can't run. Just wondering if disabling metadata property makes any sense in this context. Is this a bug that the stream is not referring to the conf?
Caused by: java.io.FileNotFoundException: File /_spark_metadata does not exist.
at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:261)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1765)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1761)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1761)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1726)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1685)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.list(HDFSMetadataLog.scala:370)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:231)
at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:99)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
the reason this was happening is that the kafkacheckpoint folder was not cleanedup. the files inside the kafka checkpointing was cross referencing the spark metadata files and failing .once i removed both it started working

Spark empty _metadata file in parquet output

I am using an Oozie workflow to generate a parquet file. Occasionally, when I try to read the file using spark, I get the following exception
java.io.IOException: Could not read footer:
java.lang.RuntimeException:
hdfs://ip-10-1-2-243.ec2.internal:8020/path/to/file/_metadata is not a
Parquet file (too small)
After deleting the metadata file, I can read in the rest of the files normally. I would like to know what causes Spark to output an empty _metadata file, and how I can avoid it in the future.

Spark Sql 1.3.0 + parquet

USING SPARK-SQL:
i've created a table without parquet in hdfs and everything is ok.
i've created the same table structure but with "store as parquet", also i've created the parquet files and upload to hdfs and "load inpath 'hdfs://servever/parquet_files'
but when i try to execute "select * from table_name";
i've this exception
Exception in thread "main" java.sql.SQLException: java.lang.IllegalArgumentException: Wrong FS: hdfs://server:8020/user/hive/warehouse/table_name, expected: file:///
any tip??
Fixed including hadoop configuration files (core-site.xml and hdfs-site.xml) in spark

Resources