How To Get Local Spark on AWS to Write to S3 - apache-spark

I have installed Spark 2.4.3 with Hadoop 3.2 on an AWS EC2 instance. I’ve been using spark (mainly pyspark) in local mode with great success. It is nice to be able to spin up something small and then resize it when I need power, and do it all very quickly. When I really need to scale I can switch to EMR and go to lunch. It all works smoothly apart from one issue: I can’t get the local spark to reliably write to S3 (I've been using local EBS space). This is clearly something to do with all the issues outlined in the docs about S3’s limitations as a file system. However, using the latest hadoop, my reading of the docs is that should be able to get it working.
Note that I'm aware of this other post, which asks a related question; there is some guidance here, but no solution that I can see. How to use new Hadoop parquet magic commiter to custom S3 server with Spark
I have the following settings (set in various places), following my best understanding of the documentation here: https://hadoop.apache.org/docs/r3.2.1/hadoop-aws/tools/hadoop-aws/index.html
fs.s3.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.committer.name: directory
fs.s3a.committer.magic.enabled: false
fs.s3a.committer.threads: 8
fs.s3a.committer.staging.tmp.path: /cache/staging
fs.s3a.committer.staging.unique-filenames: true
fs.s3a.committer.staging.conflict-mode: fail
fs.s3a.committer.staging.abort.pending.uploads: true
mapreduce.outputcommitter.factory.scheme.s3a: org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
fs.s3a.connection.maximum: 200
fs.s3a.fast.upload: true
A relevant point is that I’m saving using parquet. I see that there was some problem with the Parquet saving previously, but I don’t see this mentioned in the latest docs. Maybe this is the problem?
In any case, here is the error I’m getting, which seems indicative of the kind of error S3 gives when trying to rename the temporary folder. Is there some array of correct settings that will make this go away?
java.io.IOException: Failed to rename S3AFileStatus{path=s3://my-research-lab-recognise/spark-testing/v2/nz/raw/bank/_temporary/0/_temporary/attempt_20190910022011_0004_m_000118_248/part-00118-c8f8259f-a727-4e19-8ee2-d6962020c819-c000.snappy.parquet; isDirectory=false; length=185052; replication=1; blocksize=33554432; modification_time=1568082036000; access_time=0; owner=brett; group=brett; permission=rw-rw-rw-; isSymlink=false; hasAcl=false; isEncrypted=false; isErasureCoded=false} isEmptyDirectory=FALSE to s3://my-research-lab-recognise/spark-testing/v2/nz/raw/bank/part-00118-c8f8259f-a727-4e19-8ee2-d6962020c819-c000.snappy.parquet
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:473)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:486)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:597)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:560)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:225)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:78)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
... 10 more

I helped #brettc with his configuration and we found out the correct one to set.
Under $SPARK_HOME/conf/spark-defaults.conf
# Enable S3 file system to be recognise
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
# Parameters to use new commiters
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.fs.s3a.committer.name directory
spark.hadoop.fs.s3a.committer.magic.enabled false
spark.hadoop.fs.s3a.commiter.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.staging.unique-filenames true
spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads true
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
If you look at the last 2 configurations lines above you see that you need org.apache.spark.internal.io library which contains PathOutputCommitProtocol and BindingParquetOutputCommitter classes. To do so you have to download spark-hadoop-cloud jar here (in our case we took version 2.3.2.3.1.0.6-1) and place it under $SPARK_HOME/jars/.
You can easily verify that you are using the new committer by creating a parquet file. The _SUCCESS file should contains a json like the one below:
{
"name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
"timestamp" : 1574729145842,
"date" : "Tue Nov 26 00:45:45 UTC 2019",
"hostname" : "<hostname>",
"committer" : "directory",
"description" : "Task committer attempt_20191125234709_0000_m_000000_0",
"metrics" : { [...] },
"diagnostics" : { [...] },
"filenames" : [...]
}

Related

Spark overwrite parquet files on aws s3 raise URISyntaxException: Relative path in absolute URI

I am using Spark to write and read parquet files on AWS S3. I have parquet files which stored in
's3a://mybucket/file_name.parquet/company_name=company_name/record_day=2019-01-01 00:00:00'
partitioned by 'company_name' and 'record_day'
I want to write basic pipeline to update my parquet files on regularly basis by 'record_day'. To do this, i am gonna use overwrite mode:
df.write.mode('overwrite').parquet(s3a://mybucket/file_name.parquet/company_name='company_name'/record_day='2019-01-01 00:00:00')
But am getting unexpected error 'java.net.URISyntaxException: Relative path in absolute URI: key=2019-01-01 00:00:00'.
I spent several hours searching for the problem but found no solution(. For some tests, I replaced the 'overwrite' parameter with 'append', and everything works fine. I also made a simple dataframe and overwrite mode also works fine on it. I know that i can solve my problem in a different way, by deleting and then writing the particular part, but I would like to understand what the cause of the error is?
Spark 2.4.4 Hadoop 2.8.5
Appreciate any help.
I had the same error and the my solution was to remove the : part in the date.

Pyspark writing out to partitioned parquet using s3a issue

I have a pyspark script which reads in unpartioned single parquet file from s3, does some transformations and writes back to a another s3 bucket as partitioned by date.
Im using s3a to do the read and write. Reading in the files and performing the transformations is fine and no problem. However, when i try to write out to s3 using s3a and partitioned it throws the following error:
WARN s3a.S3AFileSystem: Found file (with /): real file? should not
happen: folder1/output
org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory
for path 's3a://bucket1/folder1/output' since it is a file.
The part of the code im using to write is as follows where im trying to append to existing directory but a new partition for new date:
output_loc = "s3a://bucket1/folder1/output/"
finalDf.write.partitionBy("date", "advertiser_id") \
.mode("append") \
.parquet(output_loc)
Im using Hadoop v3.0.0 and Spark 2.4.1
Has anyone come across this issue when using s3a instead of s3n. BTW it works fine on a older instance using s3n.
Thanks
There's an entry in your bucket s3a://bucket1/folder1/output/ with the trailing slash which is size > 0. S3A is warning that it's unhappy as that's treated as an empty-dir marker which is at risk of deletion once you add files underneath.
Look in the S3 bucket from the AWS console, see what is there, delete it
try using the output_loc without a trailing / to see if that helps (unlikely...)
Add a followup on the outcome; if the delete doesn't fix things then a hadoop JIRA may be worth filing

How to save files in same directory using saveAsNewAPIHadoopFile spark scala

I am using spark streaming and I want to save each batch of spark streaming on my local in Avro format. I have used saveAsNewAPIHadoopFile to save data in Avro format. This works well. But it overwrites the existing file. Next batch data will overwrite the old data. Is there any way to save Avro file in common directory? I tried by adding some properties of Hadoop job conf for adding a prefix in the file name. But not working any properties.
dstream.foreachRDD {
rdd.saveAsNewAPIHadoopFile(
path,
classOf[AvroKey[T]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[T]],
job.getConfiguration()
)
}
Try this -
You can make your process split into 2 steps :
Step-01 :- Write Avro file using saveAsNewAPIHadoopFile to <temp-path>
Step-02 :- Move file from <temp-path> to <actual-target-path>
This will definitely solve your problem for now. I will share my thoughts if I get to fulfill this scenario in one step instead of two.
Hope this is helpful.

Spark cannot access local file anymore?

I had this code running before
df = sc.wholeTextFiles('./dbs-*.json,./uob-*.json').flatMap(lambda x: flattenTransactionFile(json.loads(x[1]))).toDF()
But it appears that now, I get
Py4JJavaError: An error occurred while calling o24.partitions.
: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://localhost:9000/user/jiewmeng/dbs-*.json matches 0 files
Input Pattern hdfs://localhost:9000/user/jiewmeng/uob-*.json matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:330)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:272)
at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
Looks like spark is trying to use Hadoop? How can I use a local file? Also why the sudden failure? Since I managed to use ./dbs-*.json before?
By default, the location of the file is relative to your directory in HDFS. In order to refer local file system, you need to use file:///your_local_path
For e.g. in cloudera VM, if I say
sc.textFile('myfile')
it will assume the HDFS path
/user/cloudera/myfile
where as to mention my local home directory I would say
sc.textFile('file:///home/cloudera/myfile')

Spark in docker parquet error No predefined schema found

I have a https://github.com/gettyimages/docker-spark based local spark test cluster including R. In particular, this image is used: https://hub.docker.com/r/possibly/spark/
Trying to read a parquet file with sparkR this exception occurs. Reading a parquet file works without any problems on a local spark installation.
myData.parquet <- read.parquet(sqlContext, "/mappedFolder/myFile.parquet")
16/03/29 20:36:02 ERROR RBackendHandler: parquet on 4 failed
Fehler in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
java.lang.AssertionError: assertion failed: No predefined schema found, and no Parquet data files or summary files found under file:/mappedFolder/myFile.parquet.
at scala.Predef$.assert(Predef.scala:179)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$MetadataCache$$readSchema(ParquetRelation.scala:512)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache$$anonfun$12.apply(ParquetRelation.scala:421)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:421)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCac
Strangely the same error is the same - even for not existing files.
However in the terminal I can see that the files are there:
/mappedFolder/myFile.parquet
root#worker:/mappedFolder/myFile.parquet# ls
_common_metadata part-r-00097-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet part-r-00196-e5221f6f-e125-4f52-9f6d-4f38485787b3.gz.parquet
....
My initial parquet file seems to have been corrupted during my test runs of the dockerized spark.
To solve: re-create parquet files from original sources

Resources