Spark magic output committer settings not recognized - apache-spark

I'm trying to play around with different Spark output committer settings for s3, and wanted to try out the magic committer. So far I didn't manage to get my jobs to use the magic committer, and they always seem to fall back on the file output committer.
The Spark job I'm running is a simple PySpark test job that runs a simple query, repartitions the data and outputs parquet to s3:
df = spark.sql("select * from some_table where some_condition")
df.write \
.partitionBy("some_column") \
.parquet("s3://some-bucket/some-folder", mode="overwrite")
The relevant spark settings are (taken from the Spark UI, job's environment tab):
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.magic.enabled true
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.staging.tmp.path tmp/staging
spark.hadoop.fs.s3a.committer.staging.unique-filenames true
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
mapreduce.output.fileoutputformat.compress false
mapreduce.output.fileoutputformat.compress.codec org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type RECORD
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
mapreduce.fileoutputcommitter.algorithm.version 1
mapreduce.fileoutputcommitter.task.cleanup.enabled false
mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
Hadoop properties:
fs.s3a.committer.magic.enabled true
fs.s3a.committer.name magic
(Let me know if any other settings are relevant)
I'm basing the observation of file committer being used instead of magic committer on a couple of things:
Different log lines produced by the spark job seem to indicate the file output committer being used:
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:601","func":"commitTask","message":"Saved output of task 'attempt_2021...' to s3://some-bucket/some-folder/_temporary/0/
task_2021..."
"class":"org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat","file_line":"ParquetFileFormat.scala:54","message":"U
sing user defined output committer for Parquet: org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:141","func":"<init>","message":"File Outpu
t Committer Algorithm version is 1"
"class":"org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter","file_line":"FileOutputCommitter.java:156","func":"<init>","message":"FileOutput
Committer skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false"
When setting the file committer's algo to an invalid number, like so:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version -7
an exception is raised from the file committer's constructor saying the value is invalid - implicating that the file committer was initialized instead of the magic committer.
I'm not seeing any logs indicating usage of the magic committer, or any failure to initialize a committer which could explain falling back on the file committer.
Spark version is 3.1.2 using this spark-hadoop-cloud JAR. Let me know if there's any other officially published JAR I can try or if there are any other log indications that may be relevant.
Any thoughts?
===== EDIT:
Below is the stack trace I see when setting the file committer algo to an invalid value. It seems that the call to org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter ends up calling org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter which in turn initializes the incorrect type org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter instead of the configured type org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
Py4JJavaError: An error occurred while calling o259.parquet.
: java.io.IOException: Only 1 or 2 algorithm version is supported
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:143)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.<init>(FileOutputCommitter.java:117)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createFileOutputCommitter(PathOutputCommitterFactory.java:134)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitterFactory.createOutputCommitter(FileOutputCommitterFactory.java:35)
at org.apache.hadoop.mapreduce.lib.output.PathOutputCommitterFactory.createCommitter(PathOutputCommitterFactory.java:201)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:88)
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.setupCommitter(PathOutputCommitProtocol.scala:49)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:177)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:874)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)

Mystery solved - the failure to initialize the magic committer was due to a mismatch between the committer factory scheme setting to the scheme of the actual destination URL. Consider this:
The committer factory configuration was set using the key: spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a - meaning that the setting is made for s3a protocol URLs.
While th URL sent to the write method was: s3://some-bucket/some-folder - using s3 protocol instead of s3a.
The PathOutputCommitterFactory hadoop class searches for a config key with pattern mapreduce.outputcommitter.factory.scheme.%s to recognize which factory to use for the given output URL. In case the pattern set in the config key (in this case s3a) does not match the pattern in the destination URL (in this case s3) - the committer factory setting will not be recognized and the factory type will fall back on FileOutputCommitter.
Solution - make sure the outputcommitter.factory.scheme.<protocol> setting matches the protocol in the destination URL. I've successfully tested using both s3 and s3a in the URL & config key.

this does sound like a binding problem but I cannot see immediately where it is. At a glance you have all the right settings.
The easiest way to check that an S3 a committee is being used is to look at the _SUCCESS file . If it is a piece of JSON then a new committer was used… The text inside will then tell you more about the committer.
a 0 byte file means that the classic file output committer was still used

Related

add path for pipelinemodel in spark

I want to add path for Pipelinemodel in spark to load model from my local file system but it returns the following exception.
import org.apache.spark.ml.PipelineModel
val pipeline = PipelineModel.load("C:/Users/meh/Desktop/PARC_ACTIF_OM/Partie1_OM/Models_save")
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/C:/Users/meh/Desktop/PARC_ACTIF_OM/Partie1_OM/Models_save/model_final.sav/metadata
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
If C:/Users/meh/Desktop/PARC_ACTIF_OM/Partie1_OM/Models_save/model_final.sav/metadata does not exist then you likely have a format issue. The format you are saving the model in is not the format load is looking for the data. (It clearly is calling out that that metadata folder is missing and that seems critical to the load.)

How To Get Local Spark on AWS to Write to S3

I have installed Spark 2.4.3 with Hadoop 3.2 on an AWS EC2 instance. I’ve been using spark (mainly pyspark) in local mode with great success. It is nice to be able to spin up something small and then resize it when I need power, and do it all very quickly. When I really need to scale I can switch to EMR and go to lunch. It all works smoothly apart from one issue: I can’t get the local spark to reliably write to S3 (I've been using local EBS space). This is clearly something to do with all the issues outlined in the docs about S3’s limitations as a file system. However, using the latest hadoop, my reading of the docs is that should be able to get it working.
Note that I'm aware of this other post, which asks a related question; there is some guidance here, but no solution that I can see. How to use new Hadoop parquet magic commiter to custom S3 server with Spark
I have the following settings (set in various places), following my best understanding of the documentation here: https://hadoop.apache.org/docs/r3.2.1/hadoop-aws/tools/hadoop-aws/index.html
fs.s3.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3a.committer.name: directory
fs.s3a.committer.magic.enabled: false
fs.s3a.committer.threads: 8
fs.s3a.committer.staging.tmp.path: /cache/staging
fs.s3a.committer.staging.unique-filenames: true
fs.s3a.committer.staging.conflict-mode: fail
fs.s3a.committer.staging.abort.pending.uploads: true
mapreduce.outputcommitter.factory.scheme.s3a: org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
fs.s3a.connection.maximum: 200
fs.s3a.fast.upload: true
A relevant point is that I’m saving using parquet. I see that there was some problem with the Parquet saving previously, but I don’t see this mentioned in the latest docs. Maybe this is the problem?
In any case, here is the error I’m getting, which seems indicative of the kind of error S3 gives when trying to rename the temporary folder. Is there some array of correct settings that will make this go away?
java.io.IOException: Failed to rename S3AFileStatus{path=s3://my-research-lab-recognise/spark-testing/v2/nz/raw/bank/_temporary/0/_temporary/attempt_20190910022011_0004_m_000118_248/part-00118-c8f8259f-a727-4e19-8ee2-d6962020c819-c000.snappy.parquet; isDirectory=false; length=185052; replication=1; blocksize=33554432; modification_time=1568082036000; access_time=0; owner=brett; group=brett; permission=rw-rw-rw-; isSymlink=false; hasAcl=false; isEncrypted=false; isErasureCoded=false} isEmptyDirectory=FALSE to s3://my-research-lab-recognise/spark-testing/v2/nz/raw/bank/part-00118-c8f8259f-a727-4e19-8ee2-d6962020c819-c000.snappy.parquet
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:473)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:486)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:597)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:560)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.performCommit$1(SparkHadoopMapRedUtil.scala:50)
at org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:77)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitTask(HadoopMapReduceCommitProtocol.scala:225)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:78)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
... 10 more
I helped #brettc with his configuration and we found out the correct one to set.
Under $SPARK_HOME/conf/spark-defaults.conf
# Enable S3 file system to be recognise
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
# Parameters to use new commiters
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
spark.hadoop.fs.s3a.committer.name directory
spark.hadoop.fs.s3a.committer.magic.enabled false
spark.hadoop.fs.s3a.commiter.staging.conflict-mode replace
spark.hadoop.fs.s3a.committer.staging.unique-filenames true
spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads true
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
If you look at the last 2 configurations lines above you see that you need org.apache.spark.internal.io library which contains PathOutputCommitProtocol and BindingParquetOutputCommitter classes. To do so you have to download spark-hadoop-cloud jar here (in our case we took version 2.3.2.3.1.0.6-1) and place it under $SPARK_HOME/jars/.
You can easily verify that you are using the new committer by creating a parquet file. The _SUCCESS file should contains a json like the one below:
{
"name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
"timestamp" : 1574729145842,
"date" : "Tue Nov 26 00:45:45 UTC 2019",
"hostname" : "<hostname>",
"committer" : "directory",
"description" : "Task committer attempt_20191125234709_0000_m_000000_0",
"metrics" : { [...] },
"diagnostics" : { [...] },
"filenames" : [...]
}

disabling _spark_metadata in Structured streaming in spark 2.3.0

My Structured Streaming application is writing to parquet and i want to get rid of the _spark_metadata folder its creating. I used below property and it seems fine
--conf "spark.hadoop.parquet.enable.summary-metadata=false"
When the application starts no _spark_metadata folder is generated. But once it moves to RUNNING status and starts processing messages, it's failing with the below error saying _spark_metadata folder doesn't exist. Seems structured stream is relying on this folder without which we can't run. Just wondering if disabling metadata property makes any sense in this context. Is this a bug that the stream is not referring to the conf?
Caused by: java.io.FileNotFoundException: File /_spark_metadata does not exist.
at org.apache.hadoop.fs.Hdfs.listStatus(Hdfs.java:261)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1765)
at org.apache.hadoop.fs.FileContext$Util$1.next(FileContext.java:1761)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1761)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1726)
at org.apache.hadoop.fs.FileContext$Util.listStatus(FileContext.java:1685)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog$FileContextManager.list(HDFSMetadataLog.scala:370)
at org.apache.spark.sql.execution.streaming.HDFSMetadataLog.getLatest(HDFSMetadataLog.scala:231)
at org.apache.spark.sql.execution.streaming.FileStreamSink.addBatch(FileStreamSink.scala:99)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:477)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:475)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
the reason this was happening is that the kafkacheckpoint folder was not cleanedup. the files inside the kafka checkpointing was cross referencing the spark metadata files and failing .once i removed both it started working

Nifi java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations

I am following this link to set up Nifi putHDFS to write to Azure Data Lake.Connecting to Azure Data Lake from a NiFi dataflow
The Nifi is within HDF 3.1 VM and the Nifi version is 1.5.
We got the jar files mentioned in the above link, from a HD Insight(v 3.6, which supports hadoop 2.7) head node, these jars are:
adls2-oauth2-token-provider-1.0.jar
azure-data-lake-store-sdk-2.1.4.jar
hadoop-azure-datalake.jar
jackson-core-2.2.3.jar
okhttp-2.4.0.jar
okio-1.4.0.jar
And they are copied to the folder /usr/lib/hdinsight-datalake of the HDF cluster Nifi host(we only have 1 host in the cluster). And the putHDFS config(picture) is as attached(exactly as the link above)putHDFS attributes.
But in the nifi log we are getting this:
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V at org.apache.hadoop.fs.adl.AdlConfKeys.addDeprecatedKeys(AdlConfKeys.java:112) at org.apache.hadoop.fs.adl.AdlFileSystem.(AdlFileSystem.java:92) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$ExtendedConfiguration.getClassByNameOrNull(AbstractHadoopProcessor.java:490) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:172) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$1.run(AbstractHadoopProcessor.java:322) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor$1.run(AbstractHadoopProcessor.java:319) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.getFileSystemAsUser(AbstractHadoopProcessor.java:319) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.resetHDFSResources(AbstractHadoopProcessor.java:281) at org.apache.nifi.processors.hadoop.AbstractHadoopProcessor.abstractOnScheduled(AbstractHadoopProcessor.java:205) ... 16 common frames omitted
The AdlConfKeys class is from the hadoop-azure-datalake.jar file above. From the above exception, it seems to me this AdlConfKeys is loading an older version of the org.apache.hadoop.conf.Configuration class, which does not have the reloadExistingConfigurations method. However we cannot find out from where this older class gets loaded. This HDF 3.1 has the hadoop-common-XXXX.jar in multiple locations, all those on version 2.7 something has the org.apache.hadoop.conf.Configuration containing the method reloadExistingConfigurations, only those on version 2.3 don't have this method.(I decompiled both 2.7 and 2.3 jars to find out)
[root#NifiHost /]# find . -name *hadoop-common*
(the output is a lot more than below, however I removed some for display purpose, most of them are on 2.7, only 2 of them are on version 2.3):
./var/lib/nifi/work/nar/extensions/nifi-hadoop-libraries-nar-1.5.0.3.1.0.0-564.nar-unpacked/META-INF/bundled-dependencies/hadoop-common-2.7.3.jar
./var/lib/ambari-agent/cred/lib/hadoop-common-2.7.3.jar
./var/lib/ambari-server/resources.backup/views/work/WORKFLOW_MANAGER{1.0.0}/WEB-INF/lib/hadoop-common-2.7.3.2.6.2.0-205.jar
./var/lib/ambari-server/resources.backup/views/work/HUETOAMBARI_MIGRATION{1.0.0}/WEB-INF/lib/hadoop-common-2.3.0.jar
./var/lib/ambari-server/resources/views/work/HUETOAMBARI_MIGRATION{1.0.0}/WEB-INF/lib/hadoop-common-2.3.0.jar
./var/lib/ambari-server/resources/views/work/HIVE{1.5.0}/WEB-INF/lib/hadoop-common-2.7.3.2.6.4.0-91.jar
./var/lib/ambari-server/resources/views/work/CAPACITY-SCHEDULER{1.0.0}/WEB-INF/lib/hadoop-common-2.7.3.2.6.4.0-91.jar
./var/lib/ambari-server/resources/views/work/TEZ{0.7.0.2.6.2.0-205}/WEB-INF/lib/hadoop-common-2.7.3.2.6.2.0-205.jar
./usr/lib/ambari-server/hadoop-common-2.7.2.jar
./usr/hdf/3.1.0.0-564/nifi/ext/ranger/install/lib/hadoop-common-2.7.3.jar
./usr/hdf/3.0.2.0-76/nifi/ext/ranger/install/lib/hadoop-common-2.7.3.jar
So I really don't know how Nifi managed to find a hadoop-common jar file or something else containing the Configuration class does not have the method reloadExistingConfigurations(). We do not have any customized Nar files deployed to Nifi either, everything is pretty much default from whatever HDF 3.1 has on Nifi.
Please advise. I've been spending a whole day on this but can't fix the issue. Appreciate your help.
I think the Azure JARs you are using require a newer version of hadoop-common than the 2.7.3 one that NiFi is using.
If you look at the Configuration class from 2.7.3 there is no "reloadExistingConfiguration" method:
https://github.com/apache/hadoop/blob/release-2.7.3-RC2/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
It appears to be introduced sometime during 2.8.x:
https://github.com/apache/hadoop/blob/release-2.8.3-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java

Can't read files via FTP using SparkContext.textFile(...) on Google Dataproc

I am running a Spark cluster on Google Dataproc and I'm experiencing some issues while trying to read GZipped file from FTP using sparkContext.textFile(...).
The code I am running is:
object SparkFtpTest extends App {
val file = "ftp://username:password#host:21/filename.txt.gz"
val lines = sc.textFile(file)
lines.saveAsTextFile("gs://my-bucket-storage/tmp123")
}
The error that I get is:
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
I see some people have suggested that the credentials are wrong, so I've tried entering wrong credentials and the error was different, i.e. Invalid login credentials.
It also works if I copy the URL into the browser - the file is being downloaded properly.
It's also worth mentioning that I've tried using Apache commons-net library (the same version as the one in Spark - 2.2) and it worked - I was able to stream the data (from both Master and Worker nodes). I wasn't able to decompress it though (by using Java's GZipInputStream; I can't remember the failure but if you think it's important I can try and reproduce it). I think this suggests that it's not some firewall issue on the cluster, though I wasn't able to use curl to download the file.
I think I was running the same code a few months ago from my local machine and if I remember correctly it worked just fine.
Do you have any ideas what is causing this problem?
Could it be that it's some kind of dependency conflict problem and if so which one?
I have a couple of dependencies in the project such as google-sdk, solrj, ... However, I'd expect to see something like ClassNotFoundException or NoSuchMethodError if it was a dependency problem.
The whole stack trace looks like this:
16/12/05 23:53:46 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:47 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
16/12/05 23:53:49 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/0/
16/12/05 23:53:50 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/0/ - removing from cache
16/12/05 23:53:50 INFO com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Populating missing itemInfo on-demand for entry: gs://my-bucket-storage/tmp123/_temporary/
16/12/05 23:53:51 WARN com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage: Possible stale CacheEntry; failed to fetch item info for: gs://my-bucket-storage/tmp123/_temporary/ - removing from cache
Exception in thread "main" org.apache.commons.net.ftp.FTPConnectionClosedException: Connection closed without indication.
at org.apache.commons.net.ftp.FTP.__getReply(FTP.java:298)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:495)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:537)
at org.apache.commons.net.ftp.FTP.sendCommand(FTP.java:586)
at org.apache.commons.net.ftp.FTP.quit(FTP.java:794)
at org.apache.commons.net.ftp.FTPClient.logout(FTPClient.java:788)
at org.apache.hadoop.fs.ftp.FTPFileSystem.disconnect(FTPFileSystem.java:151)
at org.apache.hadoop.fs.ftp.FTPFileSystem.getFileStatus(FTPFileSystem.java:395)
at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1906)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1219)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1161)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1064)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1030)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:956)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:955)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1459)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1438)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1438)
It looks like this may be a known unresolved issue in Spark/Hadoop: https://issues.apache.org/jira/browse/HADOOP-11886 and https://github.com/databricks/learning-spark/issues/21 both allude to a similar stack trace.
If you were able to manually use the Apache commons-net library, you could achieve the same effect as sc.textFile by obtaining a list of the files, parallelizing that list of files as an RDD, and using flatMap where each task takes a filename and reads the file line-by-line, generating the output collection of lines for each file.
Alternatively, if the amount of data you have in FTP is small (up to maybe 10 GB or so) then parallel reads won't be helping too much compared to a single thread copying from your FTP server onto HDFS or GCS in your Dataproc cluster before then processing using an HDFS or GCS path in your Spark job.

Resources