Enabling the Directory Committer in Spark

Enabling the Directory Committer in Spark - apache-spark

I am trying to use S3A Partitioned(or Directory as I just need to confirm if committer is working as expected) committer with Spark. I am following this link based on which it should be pretty simple however I am running into new issues while resolving previous one
Code used for testing is (inside spark-shell):
val sourceDF = spark.range(0, 10000)
val datasets = "s3a://bucket-name/test"
sourceDF.write.format("orc").save(datasets + "orc")
spark-defaults.conf is:
spark.hadoop.fs.s3a.committer.name directory
spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
Error 1:
scala> sourceDF.write.format("orc").save(datasets + "orc")
java.lang.NoClassDefFoundError:
org/apache/hadoop/mapreduce/lib/output/PathOutputCommitter
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
at org.apache.spark.internal.io.FileCommitProtocol$.instantiate(FileCommitProtocol.scala:144)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:98)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org .apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.lib.output.PathOutputCommitter
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 81 more
Then I copied spark-hadoop-cloud_2.11-2.3.1.3.0.2.0-50.jar from this link into spark/jars folder
This resolved the previous ``NoClassDefFoundError` but produced new class def error which is:
Error 2:
java.lang.NoClassDefFoundError:
org/apache/hadoop/mapreduce/lib/output/PathOutputCommitter
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:230)
at org.apache.spark.internal.io.FileCommitProtocol$.instantiate(FileCommitProtocol.scala:144)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:98)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
....
Full stacktrace can be pasted if needed
After this, I copied hadoop-mapreduce-client-core-3.1.1.jar into spark/jars folder and ran the test code in spark-shell again. This time I got below error:
After this, I am stuck.
Error 3 (and final error where I am stuck):
scala> sourceDF.write.format("orc").save(datasets + "orc")
java.lang.NoSuchMethodError:
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.<init>(Ljava/lang/String;Ljava/lang/String;Z)V
at org.apache.spark.internal.io.cloud.PathOutputCommitProtocol.<init>(PathOutputCommitProtocol.scala:60)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.internal.io.FileCommitProtocol$.instantiate(FileCommitProtocol.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:98)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:435)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:471)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:50)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
... 48 elided
This looks like incorrect jar issue but I am not able to find the correct one.
This question is similar to previous question but could not find relevant answer hence posting again.

All the new committers config documentation I've read up to date, is missing one fundamental fact:
spark 2.x.x does not have needed support classes to make new S3a committers to function.
They promise those cloud integration libs will be bundled with spark 3.0.0, but for now you have to add libraries yourself.
Under the cloud integration maven repos there are multiple distributions supporting the committers, I found one working with the directory committer
Please also remember that the directory committer requires shared filesystem such as HDFS or NFS (we use AWS EFS) to coordinate spark worker writes to S3.
If it is not set properly, the resulting S3 write location will contain empty folder with single _SUCCESS file in it.
If you set your committer properly, the _SUCCESS file will contain the JSON status of the write. If it is empty (zero length), means you still use the default spark s3 committer.

I was able to make this work. Issue was with my spark version. I was using spark 2.2.1 version whereas atleast spark 2.3.1 is needed.
Last error mentioned in the question was pointing to wrong constructor. After digging a little bit, I found out that spark-core_2.11-2.2.1.jar was using two parameter constructor where as spark-hadoop-cloud_2.11-2.3.1.3.0.2.0-50.jar was expecting 3 parameter constructor which came only in spark-core_2.11-2.3.1.jar version. After version bump and with few tweaks, I was able to test this.
Run this command to see the issue:
javap -classpath spark-core_2.11-2.2.1.jar org/apache/spark/internal/io/HadoopMapReduceCommitProtocol
javap -classpath spark-core_2.11-2.3.1.jar org/apache/spark/internal/io/HadoopMapReduceCommitProtocol
Just to let you know, I downloaded spark 2.3.1 without pre-built hadoop version, then configured it with hadoop 3.1.0 jars and was able to make it work.

Related

Exception occured while writing delta format in AWS S3

I am using spark 3.x, java8 and delta 1.0.0 i.e. delta-core_2.12_1.0.0 in my spark job.
data is persisted in AWS S3 path in "delta" format of parquet.
Below are details of Jars I am using in my spark job.
spark-submit.sh
export SPARK_HOME=/local/apps/pkg/spark-3.0.2-bin-hadoop2.9.1-custom
--packages org.apache.spark:spark-sql_2.12:3.0.2,io.delta:delta-core_2.12:1.0.0
pom.xml
<spark.version>3.0.2</spark.version>
While saving bigger set of data job failing to write data with below error
Caused by: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.delta.files.TransactionalWrite.$anonfun$writeFiles$1(TransactionalWrite.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:130)
at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:115)
at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:81)
at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles(TransactionalWrite.scala:108)
at org.apache.spark.sql.delta.files.TransactionalWrite.writeFiles$(TransactionalWrite.scala:107)
at org.apache.spark.sql.delta.OptimisticTransaction.writeFiles(OptimisticTransaction.scala:81)
at org.apache.spark.sql.delta.commands.WriteIntoDelta.write(WriteIntoDelta.scala:106)
at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1(WriteIntoDelta.scala:65)
at org.apache.spark.sql.delta.commands.WriteIntoDelta.$anonfun$run$1$adapted(WriteIntoDelta.scala:64)
at org.apache.spark.sql.delta.DeltaLog.withNewTransaction(DeltaLog.scala:188)
at org.apache.spark.sql.delta.commands.WriteIntoDelta.run(WriteIntoDelta.scala:64)
at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:148)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:126)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:962)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:962)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:414)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:345)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:287)
at com.spgmi.ca.benchmark.datasource.DeltaDataSource.write(DeltaDataSource.java:47)
... 8 more
Caused by: org.apache.spark.SparkException: Job 67 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1(DAGScheduler.scala:979)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$cleanUpAfterSchedulerStop$1$adapted(DAGScheduler.scala:977)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:977)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:2257)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2170)
at org.apache.spark.SparkContext.$anonfun$stop$12(SparkContext.scala:1988)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1988)
at org.apache.spark.SparkContext.$anonfun$new$35(SparkContext.scala:638)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:200)
So what is wrong here ?
how to debug and fix this issue ? Any help is highly appreciated.

You're using Delta version that is incompatible with your Spark. The last version of Delta working with Spark 2.4 was version 0.6.x (0.6.2 as I remember, although I didn't check). See the versions compatibility matrix for more information.
P.S. It really makes no sense to use Spark 2.4 in 2022nd - Spark 3.0+ has a lot of optimizations compared to 2.x...

java.lang.InstantiationError: com.datastax.oss.driver.internal.core.util.collection.QueryPlan while running spark-cassandra connector

I am trying to get data using dataframes from cassandra by using spark-cassandra-connector but getting below exception.
Note: Connection is successful to cassandra.
Spark version: 2.4.1
spark-cassandra-connector version: 2.5.1
Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled.
2021-10-01 11:32:01.649 ERROR 17404 --- [ main] o.s.boot.SpringApplication : Application run failed
java.lang.InstantiationError: com.datastax.oss.driver.internal.core.util.collection.QueryPlan
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.newQueryPlan(LocalNodeFirstLoadBalancingPolicy.scala:122) ~[spark-cassandra-connector-driver_2.11-2.5.1.jar:2.5.1]
at com.datastax.oss.driver.internal.core.metadata.LoadBalancingPolicyWrapper.newQueryPlan(LoadBalancingPolicyWrapper.java:155) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.onThrottleReady(CqlRequestHandler.java:193) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.(CqlRequestHandler.java:171) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestAsyncProcessor.process(CqlRequestAsyncProcessor.java:44) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:54) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:78) ~[java-driver-core-shaded-4.11.3.jar:na]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_271]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_271]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_271]
at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_271]
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:43) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.sun.proxy.$Proxy81.execute(Unknown Source) ~[na:na]
at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$$anonfun$1.apply(TokenFactory.scala:99) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$$anonfun$1.apply(TokenFactory.scala:98) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:129) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1] at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$.forSystemLocalPartitioner(TokenFactory.scala:98) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.SplitSizeEstimator$class.tokenFactory(SplitSizeEstimator.scala:9) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tokenFactory$lzycompute(CassandraTableScanRDD.scala:64) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tokenFactory(CassandraTableScanRDD.scala:64) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.SplitSizeEstimator$class.estimateDataSize(SplitSizeEstimator.scala:12) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.SplitSizeEstimator$class.estimateSplitCount(SplitSizeEstimator.scala:21) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.estimateSplitCount(CassandraTableScanRDD.scala:64) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$1.apply$mcI$sp(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$1.apply(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$1.apply(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at scala.Option.getOrElse(Option.scala:121) ~[scala-library-2.11.12.jar:na]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.partitionGenerator$lzycompute(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.partitionGenerator(CassandraTableScanRDD.scala:224) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:273) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at scala.Option.getOrElse(Option.scala:121) ~[scala-library-2.11.12.jar:na]
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.rdd.RDD.count(RDD.scala:1168) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:455) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:45) ~[spark-core_2.11-2.4.1.jar:2.4.1]

The error you posted indicates that the embedded Java driver is not able to generate a query plan -- list of Cassandra nodes to connect to as coordinators. There is possibly an issue with how you've defined the contact points.
You normally need to specify a contact point with the cassandra.connection.host parameter. Here's an example of how you would start a Spark shell using the connector:
$ spark-shell
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.1
--conf spark.cassandra.connection.host=cassandra_ip
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
In your case, it looks like you're creating a connection from Spring Boot and you are probably running into conflicts with dependencies.
You will need to update your original question with details of your configuration including details of the dependencies plus what command you're running to connect to Spark so those answering your question have a better idea of what the problem is. Cheers!

Spark SFTP library can not download the file from sftp server when running in EMR

I am using com.springml.spark.sftp in my spark job to download the file from sftp server. The basic code is as following.
val sftpDF = spark.read.
schema(my_schema).
format("com.springml.spark.sftp").
option("host", "myhost.test.com").
option("username", "myusername").
option("password", "mypassword").
option("inferSchema", "false").
option("fileType", "csv").
option("delimiter", ",").
option("codec", "org.apache.hadoop.io.compress.GzipCodec").
load("/data/test.csv.gz")
It runs well when I run it in my local machine by using "spark-submit spark.jar". However, when I tried to run it in EMR, it shew the following errors. It seems it spark job tried to find the file in HDFS instead of SFTP.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://ip-10-61-82-166.ap-southeast-2.compute.internal:8020/tmp/test.csv.gz;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.springml.spark.sftp.DatasetRelation.read(DatasetRelation.scala:44)
at com.springml.spark.sftp.DatasetRelation.<init>(DatasetRelation.scala:29)
at com.springml.spark.sftp.DefaultSource.createRelation(DefaultSource.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:316)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.example.App$.main(App.scala:134)
at com.example.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What could be wrong? Do I need register some DataSource for SFTP module?
Thanks!

Okay, I found the reason. The version of 1.1.0 of this sftp library was used. It downloads the file to the folder of driver instead of executor. It is the reason that the error was generated when I ran it in the cluster mode instead of standalone mode. I also noticed the same issue asked here. https://github.com/springml/spark-sftp/issues/24. After upgrading the version of sftp library from 1.1.0 to 1.1.3, the problem was solved.

Unable to read Hbase data with spark in yarn cluster mode

Cluster configuration:
Hadoop: CDH-6.2.1
Spark: 2.4.0
Hbase: 2.0
What I do: Read HBase data through Spark
When I use IntelliJ and local mode everything works fine, but when I change mode to
spark-submit --master yarn, the following stacktrace happens:
20/05/20 11:00:46 ERROR mapreduce.TableInputFormat: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:221)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:114)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.initialize(TableInputFormat.java:200)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:243)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:254)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2146)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at com.song.HbaseOnSpark1$.main(HbaseOnSpark1.scala:32)
at com.song.HbaseOnSpark1.main(HbaseOnSpark1.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:219)
... 27 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.ConnectionImplementation.close(ConnectionImplementation.java:1938)
at org.apache.hadoop.hbase.client.ConnectionImplementation.<init>(ConnectionImplementation.java:310)
... 32 more
20/05/20 11:00:46 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:254)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:254)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2146)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at com.song.HbaseOnSpark1$.main(HbaseOnSpark1.scala:32)
at com.song.HbaseOnSpark1.main(HbaseOnSpark1.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:558)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:249)
... 24 more
This is my code:
val conf: SparkConf = new SparkConf().setAppName("spark1")
val spark = new SparkContext(conf)
val hbaseConf: Configuration = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum","hadoop01,hadoop02,hadoop03")
hbaseConf.set(TableInputFormat.INPUT_TABLE,"idx_name")
hbaseConf.set("hbase.defaults.for.version.skip", "true")
val rdd: RDD[(ImmutableBytesWritable, Result)] = spark.newAPIHadoopRDD(
hbaseConf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result]
)

its hbase classpatth issue in your cluster but you need to add hbase jars to your classpath like this
export SPARK_CLASSPATH=$SPARK_CLASSPATH:`hbase classpath`
hbase classpath will give all the jars for hbase connections and etc....
Why its working in local mode ?
Since all the jars required are there in ide lib
If you are using maven do a mvn depdency:tree to understand what jars are needed in the cluster. based on that you can adjust your spark-submit script.
if you are using --jars option see that all jars passed correctly or uber jar has correct dependencies when packing jar..
There might be jar conflict also check that carefully with local mode environment since thats working fine.
Further reading Spark spark-submit --jars arguments wants comma list, how to declare a directory of jars?

Unable to create RDD from data in HDFS

I am tying to create an RDD using the code but unable to do it. Is there any solution to this issue.
I have tried to run it with the localhost:port details. I have also tried running it with the entire path of the HDFS:/user/training/intel/NYSE.csv. Any path i am using is being serached only on the local directory but not on hdfs.
Thanks
scala> val myrdd = sc.textFile("/training/intel/NYSE.csv")
myrdd: org.apache.spark.rdd.RDD[String] = /training/intel/NYSE.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> myrdd.collect
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/training/intel/NYSE.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
... 48 elided
I also tried the following:
scala> val myrdd = sc.textFile("hdfs://localhost:8020/training/intel/NYSE.csv")
myrdd: org.apache.spark.rdd.RDD[String] = hdfs://localhost:8020/training/intel/NYSE.csv MapPartitionsRDD[7] at textFile at <console>:24
scala> myrdd.collect
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).; Host Details : local host is: "hadoop/127.0.0.1"; destination host is: "localhost":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1674)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
... 48 elided
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.<init>(RpcHeaderProtos.java:2201)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.<init>(RpcHeaderProtos.java:2165)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:2295)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:2290)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:3167)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1086)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
No matter how I run it, i am getting that the path doesn't exist.

The files are in HDFS.
Spark has been configured to read from your local filesystem file:/
You need to edit your core-site.xml file in your Spark installation directory to make sure that fs.defaultFS is set up correctly to use your Hadoop Namenode
InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).; Host Details : local host is: "hadoop/127.0.0.1"; destination host is: "localhost":8020;
This would imply that your Spark HDFS client is not compatible with the installed Hadoop server APIs or you're connecting to the wrong port
And also, the file is not at hdfs:///training/.. anyway.
Besides that, HDFS isn't required to learn Spark, so maybe try playing with hadoop fs commands first or move files to local system, depending on what your goals are

This happens due to internal mapping between directories. First go to the directory where your file (NYSE.csv) is kept. run command :
df -k
You will get the actual mount point of the directory. For example: /xyz
Now, try finding your file(NYSE.csv) within this mount point. For example: /xyz/training/intel/NYSE.csv and use this path in your code.
val myrdd = sc.textfile("/xyz/training/intel/NYSE.csv");

"/training/intel/NYSE.csv"
means "start looking from the top level directory". Either change it to "/user/training/intel/NYSE.csv" or just "training/intel/NYSE.csv" (no leading /) to reference a file relative to your current directory.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Enabling the Directory Committer in Spark - apache-spark

Related

Exception occured while writing delta format in AWS S3

java.lang.InstantiationError: com.datastax.oss.driver.internal.core.util.collection.QueryPlan while running spark-cassandra connector

Spark SFTP library can not download the file from sftp server when running in EMR

Unable to read Hbase data with spark in yarn cluster mode

Unable to create RDD from data in HDFS

Categories

Resources