How to Ignore Empty Parquet Files in Databricks When Creating Dataframe - apache-spark

I am having issues loading multiple files into a dataframe in Databricks. When I load a parquet file in an individual folder, it is fine, but the following error returns when I try to load multiple files in the dataframe:
DF = spark.read.parquet('S3 path/')
"org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually."
Per other StackOverflow answers, I added spark.sql.files.ignoreCorruptFiles true to the cluster configuration but it didn't seem to resolve the issue. Any other ideas?

Related

How to properly load 10 milion json files in dataframe uning PySpark

I have 10 million json files in s3 and I want load all the files in a PySpark DataFrame in databricks.
Below is the code that I have used:
schema = StructType([StructField("Col1",StringType(),True), ... , ..., ...])
df = spark.read.schema(schema).json("s3://data_bucket/json_files/")
After I counted the records I found that only 3858630 files are loaded in the dataframe. 6,141,370 files have not been loaded.
I don't know why not all the files are not being loaded.
I appreciate any help!
I would start with checking if those 6,141,370 files match the schema you specified exactly. If it doesn't, it could be nulled out.
I would chunk them into a few thousands of files and see if it works. If works, you can extend it to more files and merge the dataframes together.

Getting duplicate values for each key while querying parquet file using PySpark

We are getting duplicated values while querying data from the parquet file using PySpark.
While getting correct data after querying from presto.
Spark Version: 3.1
Configuration setup so far:
from scbuilder.kubernetes import Kubernetes
kobj = Kubernetes(kubernetes = True)
kobj.setExecutorCores(5)
kobj.setExecutorMemory("5g")
kobj.addAdditionalConf("spark.driver.memory", "8g")
kobj.setNumberOfExecutor(2)
sc = kobj.buildSparkSession()
sc.getActiveSession()
sc.conf.set('sc.hadoopConfiguration.setClass',"mapreduce.input.pathFilter.class")
sc.conf.set("hive.convertMetastoreParquet",False)
sc.conf.set("hive.input.format","org.apache.hadoop.hive.ql.io.HiveInputFormat")
Actual data count: 17722
After querying from parquet file: 1036320
Need help to understand why parquet file is showing such behavior and how we can fix it?

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

Error while reading a folder in azure databricks which has subfolders with parquet files

I am reading a folder in adls in azure databricks which has sub folders containing parquet files.
path - base_folder/filename/
filename has subfolders like 2020, 2021 and these folders again have subfolders for month and day.
So path for actual parquet file is like - base_folder/filename/2020/12/01/part11111.parquet.
I am getting below error if I give a base folder path.
I have tried commands in below tread as well but it is showing same error.
Unable to infer schema for Parquet. It must be specified manually
Please help me to read all parquet files in all sub folders in one dataframe.
Your first error: Unable to infer schema for Parquet usually happens when you try to read an empty directory as parquet. You can specify * in your path and it will go through subdirectories, take a look here: Reading parquet files from multiple directories in Pyspark.
Second error: You are using Scala API, and the example you've provided is in Python. DataFrameReader API is different. Ref: Scala - DataFrameReader - Python - DataFrameReader
Try with:
spark.read.format("parquet").load(landingFolder)
as specified here:
Generic Load/Save Functions

Spark in docker parquet error No predefined schema found

I have a https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
Trying to read a parquet file with sparkR this exception occurs. Reading a parquet file works without any problems on a local spark installation.
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
Strangely the same error is the same - even for not existing files.
However in the terminal I can see that the files are there:
/mappedFolder/myFile.parquet
root#worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....
My initial parquet file seems to have been corrupted during my test runs of the dockerized spark.
To solve: re-create parquet files from original sources

Resources