How to read nested, partitioned data from S3 into Spark? - apache-spark

I have data in S3 that follows a partition scheme like this:
s3://my_bucket/path1/p1=1642/p2=431/p3=2023-02-02 00:00:00/*.parquet"
I want to read it into a spark data frame by prefix. Something like:
df = spark.read.option("recursiveFileLookup","true").parquet('s3://my_bucket/path1/p1=1642/p2=431/*')
When I try that, I get an error that "path doesn't exist"
Even if I put the full path to a .parquet file, I still get an error related to the p3 partition:
pyspark.sql.utils.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: fdatehb=2023-02-02 00:00:00
I'm guessing that's because it has a whitespace character(?)
Can someone tell me how to read this data in spark?

Related

Unable to read Databricks Delta / Parquet File with Delta Format

I am trying to read a delta / parquet in Databricks using the follow code in Databricks
df3 = spark.read.format("delta").load('/mnt/lake/CUR/CURATED/origination/company/opportunities_final/curorigination.presentation.parquet')
However, I'm getting the following error:
A partition path fragment should be the form like `part1=foo/part2=bar`. The partition path: curorigination.presentation.parquet
This seemed very straightforward, but not sure why I'm getting the error
Any thoughts?
The file structure looks like the following
The error shows that delta lake think you have wrong partition path naming.
If you have any partition column in your delta table, for example year month day, your path should look like /mnt/lake/CUR/CURATED/origination/company/opportunities_final/year=yyyy/month=mm/day=dd/curorigination.presentation.parquet and you just need to do df = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final").
If you just read it as parquet, you can just do df = spark.read.parquet("/mnt/lake/CUR/CURATED/origination/company/opportunities_final") because you don't need to read the absolute path of the parquet file.
The above error mainly happens because of incorrect path format curorigination.presentation.parquet. please check your delta location and also check whether delta file is created or not :
%fs ls /mnt/lake/CUR/CURATED/origination/company/opportunities_final/
I reproduced the same thing in my environment. First of all, I created a data frame with a parquet file.
df1 = spark.read.format("parquet").load("/FileStore/tables/")
display(df1)
After that I just converted the parquet file into delta format and saved the file into this location/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1 .
df1.coalesce(1).write.format('delta').mode("overwrite").save("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta1")
#Reading delta file
df3 = spark.read.format("delta").load("/mnt/lake/CUR/CURATED/origination/company/opportunities_final/demo_delta")
display(df3)

PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'

I am trying to use petastorm in a different manner which requires that I tell it where my parquet files are stored through one of the following:
hdfs://some_hdfs_cluster/user/yevgeni/parquet8, or file:///tmp/mydataset, or s3://bucket/mydataset, or gs://bucket/mydataset. Since I am on DataBricks and given other constraints, my option is to use the file:/// option.
However, I am at a loss as to how specify the location of my parquet files. I continually get rejected saying that Path does not exist:
Here is what I am doing:
# save spark df to parquet
dbutils.fs.rm('dbfs:/mnt/team01/assembled_train.parquet', recurse=True)
assembled_train.write.parquet('dbfs:/mnt/team01/assembled_train')
# look at files
display(dbutils.fs.ls('mnt/team01/assembled_train/'))
# results
path name size
dbfs:/mnt/team01/assembled_train/_SUCCESS _SUCCESS 0
dbfs:/mnt/team01/assembled_train/_committed_2150262571233317067 _committed_2150262571233317067 856
dbfs:/mnt/team01/assembled_train/_started_2150262571233317067 _started_2150262571233317067 0
dbfs:/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet 578991
dbfs:/mnt/team01/assembled_train/part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet part-00001-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035358-1-c000.snappy.parquet 579640
dbfs:/mnt/team01/assembled_train/part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet part-00002-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035359-1-c000.snappy.parquet 580675
dbfs:/mnt/team01/assembled_train/part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet part-00003-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035360-1-c000.snappy.parquet 579483
dbfs:/mnt/team01/assembled_train/part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet part-00004-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035361-1-c000.snappy.parquet 578807
dbfs:/mnt/team01/assembled_train/part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet part-00005-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035362-1-c000.snappy.parquet 580942
dbfs:/mnt/team01/assembled_train/part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet part-00006-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035363-1-c000.snappy.parquet 579202
dbfs:/mnt/team01/assembled_train/part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet part-00007-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035364-1-c000.snappy.parquet 579810
While testing with a basic dataframe load from the file structure, like so:
df1 = spark.read.option("header", "true").parquet('file:///mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')```
I get file does not exist.
You just need to specify the path as it is, no need for 'file:///':
df1 = spark.read.option("header", "true").parquet('/mnt/team01/assembled_train/part-00000-tid-2150262571233317067-79e6b077-3770-47a9-9fec-155a412768f1-1035357-1-c000.snappy.parquet')
If this doesn't work, try the methods in https://docs.databricks.com/applications/machine-learning/load-data/petastorm.html#configure-cache-directory

Spark overwrite parquet files on aws s3 raise URISyntaxException: Relative path in absolute URI

I am using Spark to write and read parquet files on AWS S3. I have parquet files which stored in
's3a://mybucket/file_name.parquet/company_name=company_name/record_day=2019-01-01 00:00:00'
partitioned by 'company_name' and 'record_day'
I want to write basic pipeline to update my parquet files on regularly basis by 'record_day'. To do this, i am gonna use overwrite mode:
df.write.mode('overwrite').parquet(s3a://mybucket/file_name.parquet/company_name='company_name'/record_day='2019-01-01 00:00:00')
But am getting unexpected error 'java.net.URISyntaxException: Relative path in absolute URI: key=2019-01-01 00:00:00'.
I spent several hours searching for the problem but found no solution(. For some tests, I replaced the 'overwrite' parameter with 'append', and everything works fine. I also made a simple dataframe and overwrite mode also works fine on it. I know that i can solve my problem in a different way, by deleting and then writing the particular part, but I would like to understand what the cause of the error is?
Spark 2.4.4 Hadoop 2.8.5
Appreciate any help.
I had the same error and the my solution was to remove the : part in the date.

Dataframe not able to write on S3

I am creating a dataframe from existing hive table.Table is partitioned on date and site column.Now, when i am trying to overwrite the data in this same table after some computation with previous day data.It is successfully getting loaded.
But when i am trying to write final dataframe at S3 bucket. I am getting error saying file not found.Now the file it is mentioning is previous day file which is now overwritten.
If i write dataframe first and then overwrite table then its running fine.
For writing at S3 location , what it has to do with table partition file?
Below is the error and code.
java.io.FileNotFoundException: No such file or directory: s3://bucket_1/DM/web_fact_tbl/local_dt=2018-05-10/site_name=ABC/part-00000-882a6e29-eb6a-477c-8b88-6fe853956674.c000
fact_tbl = spark.table('db.web_fact_tbl')
fact_lkp = fact_tbl.filter(fact_tbl['local_dt']=='2018-05-10')
fact_join = fact_lkp.alias('a').join(fact_tbl.alias('b'),(col('a.id') == col('b.id')),"inner").select('a.*')
fact_final = fact_join.union(fact_tbl)
fact_final.coalesce(2).createOrReplaceTempView('cwf')
spark.sql('INSERT OVERWRITE TABLE dm.web_fact_tbl PARTITION (local_dt, site_name) \
SELECT * FROM cwf')
fact_final.write.csv('s3://bucket_1/yahoo')
Before last line fact_final is just a "lazy" dataframe object that contains definitions only. It does not contain any data. But it has pointer to exact data files, where data is stored actually.
When you try to perform actual operations (does not matter it's writing to S3, or executing query like fact_final.count()) you'll get the error as above. It looks like partition local_dt=2018-05-10 does not exists anymore (files/folder that sits behind it does not exists).
You can try to re-initialize dataframe once again, before final write (it's another lazy operation - all work is done in your case while you writing it on S3).

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5"
outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
But then:
outcome2 = sqlc.read.parquet(response) # fail
fails with:
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
in
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Using Spark 2.1.1. Also fails in 2.2.0.
Found this bug report, but was fixed in
2.0.1, 2.1.0.
UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".
This error usually occurs when you try to read an empty directory as parquet.
Probably your outcome Dataframe is empty.
You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.
In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.
See also:
Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2
I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket).
After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).
Glue was trying to apply data catalog table schema on a file which doesn't exist.
After copying file into s3 bucket file location, issue got resolved.
Hope this helps someone who encounters/encountered an error in AWS Glue.
This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.
Besides that with parquet, the same thing happens with ORC.
I see there are already so many Answers. But the issue I faced was my Spark job was trying to read a file which is being overwritten by another Spark job that was previously started. It sounds bad, but I did that mistake.
Just to emphasize #Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename
val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
.load("/Users/myuser/_HEADER_0")
org.apache.spark.sql.AnalysisException:
Unable to infer schema for CSV. It must be specified manually.;
Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)
Happened to me for a parquet file that was in the process of being written to.
Just need to wait for it to be completely written.
In my case, the error occurred because the filename contained underscores. Rewriting / reading the file without underscores (hyphens were OK) solved the problem...
For me this happened when I thought loading the correct file path but instead pointed a incorrect folder
I ran into a similar problem with reading a csv
spark.read.csv("s3a://bucket/spark/csv_dir/.")
gave an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.;
I found if I removed the trailing . and then it works. ie:
spark.read.csv("s3a://bucket/spark/csv_dir/")
I tested this for parquet adding a trailing . and you get an error of:
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I just encountered the same problem but none of the solutions here work for me. I try to merge the row groups of my parquet files on hdfs by first reading them and write it to another place using:
df = spark.read.parquet('somewhere')
df.write.parquet('somewhere else')
But later when I query it with
spark.sql('SELECT sth FROM parquet.`hdfs://host:port/parquetfolder/` WHERE .. ')
It shows the same problem.
I finally solve this by using pyarrow:
df = spark.read.parquet('somewhere')
pdf = df.toPandas()
adf = pa.Table.from_pandas(pdf) # import pyarrow as pa
fs = pa.hdfs.connect()
fw = fs.open(path, 'wb')
pq.write_table(adf, fw) # import pyarrow.parquet as pq
fw.close()
Check if .parquet files available at the response path. I am assuming, either files don't exist or it may be exist in some internal(partitioned) folders. If files are available under multiple hierarchy folders then append /* for each folder.
As in my case .parquet files were under 3 folders from base_path, so I have given path as base_path/*/*/*
As others mentioned, in my case this error appeared when I was reading S3 keys that did not exist.
A solution is filter-in keys that do exist:
import com.amazonaws.services.s3.AmazonS3URI
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.spark.sql.SparkSession
import java.net.URI
def addEndpointToUrl(url: String, domain: String = "s3.amazonaws.com"): String = {
val uri = new URI(url)
val hostWithEndpoint = uri.getHost + "." + domain
new URI(uri.getScheme, uri.getUserInfo, hostWithEndpoint, uri.getPort, uri.getPath, uri.getQuery, uri.getFragment).toString
}
def createS3URI(url: String): AmazonS3URI = {
try {
// try to instantiate AmazonS3URI with url
new AmazonS3URI(url)
} catch {
case e: IllegalArgumentException if e.getMessage.
startsWith("Invalid S3 URI: hostname does not appear to be a valid S3 endpoint") => {
new AmazonS3URI(addEndpointToUrl(url))
}
}
}
def s3FileExists(spark: SparkSession, url: String): Boolean = {
val amazonS3Uri: AmazonS3URI = createS3URI(url)
val s3BucketUri = new URI(s"${amazonS3Uri.getURI().getScheme}://${amazonS3Uri.getBucket}")
FileSystem
.get(s3BucketUri, spark.sparkContext.hadoopConfiguration)
.exists(new Path(url))
}
and you can use it as:
val partitions = List(yesterday, today, tomorrow)
.map(f => somepath + "/date=" + f)
.filter(f => s3FileExists(spark, f))
val df = spark.read.parquet(partitions: _*)
For that solution I took some code out of spark-redshift project, here.
You are just loading a parquet file , Of course parquet had valid
schema. Otherwise it would not been saved as parquet. This error means
-
Either parquet file does not exist . (99.99% cases this is the issue. Spark error messages are often less obvious)
Somehow the parquet file got corrupted or Or It's not a parquet file at all
I ran into this issue because of folder in folder issue.
for example folderA.parquet was supposed to have partion.... but instead it have folderB.parquet which inside have partition.
Resolution,
transfer the file to parent folder and delete the subfolder.
you can read with /*
outcome2 = sqlc.read.parquet(f"{response}/*") # work for me
My problem was moving the files to another location and not changing the path in the files in the folder _spark_metadata.
To fix it do something like:
sed -i "s|/old/path|/new/path|g" ** .*
Seems this issue can be caused by a lot of reasons, I am providing another scenario:
By default the spark parquet source is using "partition inferring" which means it requires the file path to be partition in Key=Value pairs and the loads happens at the root.
To avoid this, if we assure all the leaf files have identical schema, then we can use
df = spark.read.format("parquet")\
.option("recursiveFileLookup", "true")
to disable the "partition inferring" manually. It basically load files one by one using the parquet's logical_type.

Resources