Reading Parquet file with Pyspark returns java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths - apache-spark

I am trying to load parquet files in the following directories:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
This is what I wrote in Pyspark
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.parquet(s3_bucket_location_of_data)
but I received the following error:
Py4JJavaError: An error occurred while calling o109.parquet.
: java.lang.AssertionError: assertion failed: Conflicting directory structures detected. Suspicious paths:
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-1
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-2
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-3
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-4
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-5
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-6
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-7
s3://dir1/model=m1/version=newest/versionnumber=3/scores/marketplace_id-8
After reading other StackOverflow posts like this, I tried the following:
base_path="s3://dir1/" # I have tried to set this to "s3://dir1/model=m1/version=newest/versionnumber=3/scores/" as well, but it didn't work
s3_bucket_location_of_data = "s3://dir1/model=m1/version=newest/versionnumber=3/scores/"
df = spark.read.option("basePath", base_path).parquet(s3_bucket_location_of_data)
but that returned a similar error message as above. I am new to Spark/Pyspark and I don't know what I could possibly be doing wrong here. Thank you in advance for your answers!

You don't need to specify the detailed path. Just load the files from the base_path.
df = spark.read.parquet("s3://dir1")
df.filter("model = 'm1' and version = 'newest' and versionnumber = 3")
The directory structure is already partitioned by 3 columns, model, version and versionnumber. So read the base and filter the partition, then you could read all the parquet files under the partition path.

Related

Databricks Error: AnalysisException: Incompatible format detected. with Delta

I'm getting the following error when I attempt to write to my data lake with Delta on Databricks
fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")
fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')
The above produces the following error:
AnalysisException: Incompatible format detected.
You are trying to write to `/mnt/lake/BASE/flights/Full/` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to write to the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html
Any reason for the error?
Such error usually occurs when you have data in another format inside the folder. For example, if you wrote Parquet or CSV files into it before. Remove the folder completely and try again
This worked in my similar situation:
%sql CONVERT TO DELTA parquet.`/mnt/lake/BASE/flights/Full/`

How to Ignore Empty Parquet Files in Databricks When Creating Dataframe

I am having issues loading multiple files into a dataframe in Databricks. When I load a parquet file in an individual folder, it is fine, but the following error returns when I try to load multiple files in the dataframe:
DF = spark.read.parquet('S3 path/')
"org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually."
Per other StackOverflow answers, I added spark.sql.files.ignoreCorruptFiles true to the cluster configuration but it didn't seem to resolve the issue. Any other ideas?

Spark (PySpark) File Already Exists Exception

I am trying to save a data frame as a text file, however, I am getting a File Already Exists exception. I tried adding the mode to the code but to no avail. Furthermore, the file does not actually exists. Would anyone have an idea how I can solve this problem? I am using PySpark
This is the code:
distFile = sc.textFile("/Users/jeremy/Downloads/sample2.nq")
mapper = distFile.map(lambda q: __q2v(q))
reducer = mapper.reduceByKey(lambda a, b: a + os.linesep + b)
data_frame = reducer.toDF(["context", "triples"])
data_frame.coalesce(1).write.partitionBy("context").text("/Users/jeremy/Desktop/so")
May I add that the exception is being raised after some time and that some data is actually stored in temporary files (which are obviously deleted).
Thanks!
Edit: Exception can be found here: https://gist.github.com/jerdeb/c30f65dc632fb997af289dac4d40c743
you can used overwrite or append for replacing the file or adding the data into same file.
data_frame.coalesce(1).write.mode('overwrite').partitionBy("context").text("/Users/jeremy/Desktop/so")
or
data_frame.coalesce(1).write.mode('append').partitionBy("context").text("/Users/jeremy/Desktop/so")
I had the same problem and was able get around it with this:
outputDir = "/FileStore/tables/my_result/"
dbutils.fs.rm(outputDir , True)
Just change the outputDir variable to whatever directory you are writing to.
You should check your executors and look at the logs of the ones that are failing.
In my case, I had a coalesce(1) on a large DF. 4 of my executors failed - 3 of them had the same error of org.apache.hadoop.fs.FileAlreadyExistsException: File already exists.
However, 1 of them had a different exception: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 262144 bytes of memory, got 148328
I was able to fix it by increasing the executor memory so that the coalesce did not cause an out of memory error.

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5"
outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
But then:
outcome2 = sqlc.read.parquet(response) # fail
fails with:
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
in
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Using Spark 2.1.1. Also fails in 2.2.0.
Found this bug report, but was fixed in
2.0.1, 2.1.0.
UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".
This error usually occurs when you try to read an empty directory as parquet.
Probably your outcome Dataframe is empty.
You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.
In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.
See also:
Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket).
After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).
Glue was trying to apply data catalog table schema on a file which doesn't exist.
After copying file into s3 bucket file location, issue got resolved.
Hope this helps someone who encounters/encountered an error in AWS Glue.
This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.
Besides that with parquet, the same thing happens with ORC.
I see there are already so many Answers. But the issue I faced was my Spark job was trying to read a file which is being overwritten by another Spark job that was previously started. It sounds bad, but I did that mistake.
Just to emphasize #Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename
val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
.load("/Users/myuser/_HEADER_0")
org.apache.spark.sql.AnalysisException:
Unable to infer schema for CSV. It must be specified manually.;
Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)
Happened to me for a parquet file that was in the process of being written to.
Just need to wait for it to be completely written.
In my case, the error occurred because the filename contained underscores. Rewriting / reading the file without underscores (hyphens were OK) solved the problem...
For me this happened when I thought loading the correct file path but instead pointed a incorrect folder
I ran into a similar problem with reading a csv
spark.read.csv("s3a://bucket/spark/csv_dir/.")
gave an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
I found if I removed the trailing . and then it works. ie:
spark.read.csv("s3a://bucket/spark/csv_dir/")
I tested this for parquet adding a trailing . and you get an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I just encountered the same problem but none of the solutions here work for me. I try to merge the row groups of my parquet files on hdfs by first reading them and write it to another place using:
df = spark.read.parquet('somewhere')
df.write.parquet('somewhere else')
But later when I query it with
spark.sql('SELECT sth FROM parquet.`hdfs://host:port/parquetfolder/` WHERE .. ')
It shows the same problem.
I finally solve this by using pyarrow:
df = spark.read.parquet('somewhere')
pdf = df.toPandas()
adf = pa.Table.from_pandas(pdf) # import pyarrow as pa
fs = pa.hdfs.connect()
fw = fs.open(path, 'wb')
pq.write_table(adf, fw) # import pyarrow.parquet as pq
fw.close()
Check if .parquet files available at the response path. I am assuming, either files don't exist or it may be exist in some internal(partitioned) folders. If files are available under multiple hierarchy folders then append /* for each folder.
As in my case .parquet files were under 3 folders from base_path, so I have given path as base_path/*/*/*
As others mentioned, in my case this error appeared when I was reading S3 keys that did not exist.
A solution is filter-in keys that do exist:
import com.amazonaws.services.s3.AmazonS3URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
import java.net.URI
def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String = {
val uri = new URI(url)
val hostWithEndpoint = uri.getHost + "." + domain
new URI(uri.getScheme, uri.getUserInfo, hostWithEndpoint, uri.getPort, uri.getPath, uri.getQuery, uri.getFragment).toString
}
def createS3URI(url: String): AmazonS3URI = {
try {
// try to instantiate AmazonS3URI with url
new AmazonS3URI(url)
} catch {
case e: IllegalArgumentException if e.getMessage.
startsWith("Invalid S3 URI: hostname does not appear to be a valid S3 endpoint") => {
new AmazonS3URI(addEndpointToUrl(url))
}
}
}
def s3FileExists(spark: SparkSession, url: String): Boolean = {
val amazonS3Uri: AmazonS3URI = createS3URI(url)
val s3BucketUri = new URI(s"${amazonS3Uri.getURI().getScheme}://${amazonS3Uri.getBucket}")
FileSystem
.get(s3BucketUri, spark.sparkContext.hadoopConfiguration)
.exists(new Path(url))
}
and you can use it as:
val partitions = List(yesterday, today, tomorrow)
.map(f => somepath + "/date=" + f)
.filter(f => s3FileExists(spark, f))
val df = spark.read.parquet(partitions: _*)
For that solution I took some code out of spark-redshift project, here.
You are just loading a parquet file , Of course parquet had valid
schema. Otherwise it would not been saved as parquet. This error means
-
Either parquet file does not exist . (99.99% cases this is the issue. Spark error messages are often less obvious)
Somehow the parquet file got corrupted or Or It's not a parquet file at all
I ran into this issue because of folder in folder issue.
for example folderA.parquet was supposed to have partion.... but instead it have folderB.parquet which inside have partition.
Resolution,
transfer the file to parent folder and delete the subfolder.
you can read with /*
outcome2 = sqlc.read.parquet(f"{response}/*") # work for me
My problem was moving the files to another location and not changing the path in the files in the folder _spark_metadata.
To fix it do something like:
sed -i "s|/old/path|/new/path|g" ** .*
Seems this issue can be caused by a lot of reasons, I am providing another scenario:
By default the spark parquet source is using "partition inferring" which means it requires the file path to be partition in Key=Value pairs and the loads happens at the root.
To avoid this, if we assure all the leaf files have identical schema, then we can use
df = spark.read.format("parquet")\
.option("recursiveFileLookup", "true")
to disable the "partition inferring" manually. It basically load files one by one using the parquet's logical_type.

org.apache.spark.sql.AnalysisException: Path does not exist

I was having issues trying to read a parquet file stored as a resource in my fat-jar so I tried following code which reads the resource file and copies it onto disk:
val inputFile = "test.parquet"
val parquetFile = "/part-r-00000-2185f9a7-ea70-41be-95d2-e9f70f93c43b.parquet"
FileUtils.copyInputStreamToFile(Main2.getClass.getResourceAsStream(parquetFile), new File(inputFile))
LOGGER.info("saved resource to external file")
This code runs successfully. But when I try to read the file using:
spark.sqlContext.read.parquet(inputFile)
I get this error:
ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://nameservice1/user/me/test.parquet
How can I fix this? I just want to be able to read a parquet file stored as resource in fat-jar. I have tried many things but none of them work.
FileUtils.copyInputStreamToFile
copy Input stream of the file in your Fat jar to local file system not in Distributed file system i.e HDFS.
Try with below code should work
spark.sqlContext.read.parquet("file:////< absolute path of inputFile >")

Resources