Unable to create RDD from data in HDFS

Unable to create RDD from data in HDFS - apache-spark

I am tying to create an RDD using the code but unable to do it. Is there any solution to this issue.
I have tried to run it with the localhost:port details. I have also tried running it with the entire path of the HDFS:/user/training/intel/NYSE.csv. Any path i am using is being serached only on the local directory but not on hdfs.
Thanks
scala> val myrdd = sc.textFile("/training/intel/NYSE.csv")
myrdd: org.apache.spark.rdd.RDD[String] = /training/intel/NYSE.csv MapPartitionsRDD[5] at textFile at <console>:24
scala> myrdd.collect
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/training/intel/NYSE.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
... 48 elided
I also tried the following:
scala> val myrdd = sc.textFile("hdfs://localhost:8020/training/intel/NYSE.csv")
myrdd: org.apache.spark.rdd.RDD[String] = hdfs://localhost:8020/training/intel/NYSE.csv MapPartitionsRDD[7] at textFile at <console>:24
scala> myrdd.collect
java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).; Host Details : local host is: "hadoop/127.0.0.1"; destination host is: "localhost":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy24.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2108)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:57)
at org.apache.hadoop.fs.Globber.glob(Globber.java:252)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1674)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:259)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
... 48 elided
Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).
at com.google.protobuf.InvalidProtocolBufferException.invalidTag(InvalidProtocolBufferException.java:89)
at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:108)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.<init>(RpcHeaderProtos.java:2201)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.<init>(RpcHeaderProtos.java:2165)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:2295)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto$1.parsePartialFrom(RpcHeaderProtos.java:2290)
at com.google.protobuf.AbstractParser.parsePartialFrom(AbstractParser.java:200)
at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:241)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
at org.apache.hadoop.ipc.protobuf.RpcHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcHeaderProtos.java:3167)
at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1086)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:979)
No matter how I run it, i am getting that the path doesn't exist.

The files are in HDFS.
Spark has been configured to read from your local filesystem file:/
You need to edit your core-site.xml file in your Spark installation directory to make sure that fs.defaultFS is set up correctly to use your Hadoop Namenode
InvalidProtocolBufferException: Protocol message contained an invalid tag (zero).; Host Details : local host is: "hadoop/127.0.0.1"; destination host is: "localhost":8020;
This would imply that your Spark HDFS client is not compatible with the installed Hadoop server APIs or you're connecting to the wrong port
And also, the file is not at hdfs:///training/.. anyway.
Besides that, HDFS isn't required to learn Spark, so maybe try playing with hadoop fs commands first or move files to local system, depending on what your goals are

This happens due to internal mapping between directories. First go to the directory where your file (NYSE.csv) is kept. run command :
df -k
You will get the actual mount point of the directory. For example: /xyz
Now, try finding your file(NYSE.csv) within this mount point. For example: /xyz/training/intel/NYSE.csv and use this path in your code.
val myrdd = sc.textfile("/xyz/training/intel/NYSE.csv");

"/training/intel/NYSE.csv"
means "start looking from the top level directory". Either change it to "/user/training/intel/NYSE.csv" or just "training/intel/NYSE.csv" (no leading /) to reference a file relative to your current directory.

Related

java.lang.InstantiationError: com.datastax.oss.driver.internal.core.util.collection.QueryPlan while running spark-cassandra connector

I am trying to get data using dataframes from cassandra by using spark-cassandra-connector but getting below exception.
Note: Connection is successful to cassandra.
Spark version: 2.4.1
spark-cassandra-connector version: 2.5.1
Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled.
2021-10-01 11:32:01.649 ERROR 17404 --- [ main] o.s.boot.SpringApplication : Application run failed
java.lang.InstantiationError: com.datastax.oss.driver.internal.core.util.collection.QueryPlan
at com.datastax.spark.connector.cql.LocalNodeFirstLoadBalancingPolicy.newQueryPlan(LocalNodeFirstLoadBalancingPolicy.scala:122) ~[spark-cassandra-connector-driver_2.11-2.5.1.jar:2.5.1]
at com.datastax.oss.driver.internal.core.metadata.LoadBalancingPolicyWrapper.newQueryPlan(LoadBalancingPolicyWrapper.java:155) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.onThrottleReady(CqlRequestHandler.java:193) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.session.throttling.PassThroughRequestThrottler.register(PassThroughRequestThrottler.java:52) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.(CqlRequestHandler.java:171) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestAsyncProcessor.process(CqlRequestAsyncProcessor.java:44) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:54) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.cql.CqlRequestSyncProcessor.process(CqlRequestSyncProcessor.java:30) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.internal.core.session.DefaultSession.execute(DefaultSession.java:230) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:54) ~[java-driver-core-shaded-4.11.3.jar:na]
at com.datastax.oss.driver.api.core.cql.SyncCqlSession.execute(SyncCqlSession.java:78) ~[java-driver-core-shaded-4.11.3.jar:na]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_271]
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_271]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_271]
at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_271]
at com.datastax.spark.connector.cql.SessionProxy.invoke(SessionProxy.scala:43) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.sun.proxy.$Proxy81.execute(Unknown Source) ~[na:na]
at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$$anonfun$1.apply(TokenFactory.scala:99) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$$anonfun$1.apply(TokenFactory.scala:98) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:129) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1] at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.dht.TokenFactory$.forSystemLocalPartitioner(TokenFactory.scala:98) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.SplitSizeEstimator$class.tokenFactory(SplitSizeEstimator.scala:9) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tokenFactory$lzycompute(CassandraTableScanRDD.scala:64) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.tokenFactory(CassandraTableScanRDD.scala:64) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.SplitSizeEstimator$class.estimateDataSize(SplitSizeEstimator.scala:12) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.partitioner.SplitSizeEstimator$class.estimateSplitCount(SplitSizeEstimator.scala:21) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.estimateSplitCount(CassandraTableScanRDD.scala:64) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$1.apply$mcI$sp(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$1.apply(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD$$anonfun$1.apply(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at scala.Option.getOrElse(Option.scala:121) ~[scala-library-2.11.12.jar:na]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.partitionGenerator$lzycompute(CassandraTableScanRDD.scala:228) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.partitionGenerator(CassandraTableScanRDD.scala:224) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at com.datastax.spark.connector.rdd.CassandraTableScanRDD.getPartitions(CassandraTableScanRDD.scala:273) ~[spark-cassandra-connector_2.11-2.5.1.jar:2.5.1]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at scala.Option.getOrElse(Option.scala:121) ~[scala-library-2.11.12.jar:na]
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.rdd.RDD.count(RDD.scala:1168) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.api.java.JavaRDDLike$class.count(JavaRDDLike.scala:455) ~[spark-core_2.11-2.4.1.jar:2.4.1]
at org.apache.spark.api.java.AbstractJavaRDDLike.count(JavaRDDLike.scala:45) ~[spark-core_2.11-2.4.1.jar:2.4.1]

The error you posted indicates that the embedded Java driver is not able to generate a query plan -- list of Cassandra nodes to connect to as coordinators. There is possibly an issue with how you've defined the contact points.
You normally need to specify a contact point with the cassandra.connection.host parameter. Here's an example of how you would start a Spark shell using the connector:
$ spark-shell
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.5.1
--conf spark.cassandra.connection.host=cassandra_ip
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions
In your case, it looks like you're creating a connection from Spring Boot and you are probably running into conflicts with dependencies.
You will need to update your original question with details of your configuration including details of the dependencies plus what command you're running to connect to Spark so those answering your question have a better idea of what the problem is. Cheers!

Spark SFTP library can not download the file from sftp server when running in EMR

I am using com.springml.spark.sftp in my spark job to download the file from sftp server. The basic code is as following.
val sftpDF = spark.read.
schema(my_schema).
format("com.springml.spark.sftp").
option("host", "myhost.test.com").
option("username", "myusername").
option("password", "mypassword").
option("inferSchema", "false").
option("fileType", "csv").
option("delimiter", ",").
option("codec", "org.apache.hadoop.io.compress.GzipCodec").
load("/data/test.csv.gz")
It runs well when I run it in my local machine by using "spark-submit spark.jar". However, when I tried to run it in EMR, it shew the following errors. It seems it spark job tried to find the file in HDFS instead of SFTP.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://ip-10-61-82-166.ap-southeast-2.compute.internal:8020/tmp/test.csv.gz;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:545)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:359)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.springml.spark.sftp.DatasetRelation.read(DatasetRelation.scala:44)
at com.springml.spark.sftp.DatasetRelation.<init>(DatasetRelation.scala:29)
at com.springml.spark.sftp.DefaultSource.createRelation(DefaultSource.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:316)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at com.example.App$.main(App.scala:134)
at com.example.App.main(App.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What could be wrong? Do I need register some DataSource for SFTP module?
Thanks!

Okay, I found the reason. The version of 1.1.0 of this sftp library was used. It downloads the file to the folder of driver instead of executor. It is the reason that the error was generated when I ran it in the cluster mode instead of standalone mode. I also noticed the same issue asked here. https://github.com/springml/spark-sftp/issues/24. After upgrading the version of sftp library from 1.1.0 to 1.1.3, the problem was solved.

Unable to read Hbase data with spark in yarn cluster mode

Cluster configuration:
Hadoop: CDH-6.2.1
Spark: 2.4.0
Hbase: 2.0
What I do: Read HBase data through Spark
When I use IntelliJ and local mode everything works fine, but when I change mode to
spark-submit --master yarn, the following stacktrace happens:
20/05/20 11:00:46 ERROR mapreduce.TableInputFormat: java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:221)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:114)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.initialize(TableInputFormat.java:200)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:243)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:254)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2146)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at com.song.HbaseOnSpark1$.main(HbaseOnSpark1.scala:32)
at com.song.HbaseOnSpark1.main(HbaseOnSpark1.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:219)
... 27 more
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.ConnectionImplementation.close(ConnectionImplementation.java:1938)
at org.apache.hadoop.hbase.client.ConnectionImplementation.<init>(ConnectionImplementation.java:310)
... 32 more
20/05/20 11:00:46 ERROR yarn.ApplicationMaster: User class threw exception: java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:254)
at org.apache.hadoop.hbase.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:254)
at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2146)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at com.song.HbaseOnSpark1$.main(HbaseOnSpark1.scala:32)
at com.song.HbaseOnSpark1.main(HbaseOnSpark1.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:673)
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:558)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:249)
... 24 more
This is my code:
val conf: SparkConf = new SparkConf().setAppName("spark1")
val spark = new SparkContext(conf)
val hbaseConf: Configuration = HBaseConfiguration.create()
hbaseConf.set("hbase.zookeeper.quorum","hadoop01,hadoop02,hadoop03")
hbaseConf.set(TableInputFormat.INPUT_TABLE,"idx_name")
hbaseConf.set("hbase.defaults.for.version.skip", "true")
val rdd: RDD[(ImmutableBytesWritable, Result)] = spark.newAPIHadoopRDD(
hbaseConf,
classOf[TableInputFormat],
classOf[ImmutableBytesWritable],
classOf[Result]
)

its hbase classpatth issue in your cluster but you need to add hbase jars to your classpath like this
export SPARK_CLASSPATH=$SPARK_CLASSPATH:`hbase classpath`
hbase classpath will give all the jars for hbase connections and etc....
Why its working in local mode ?
Since all the jars required are there in ide lib
If you are using maven do a mvn depdency:tree to understand what jars are needed in the cluster. based on that you can adjust your spark-submit script.
if you are using --jars option see that all jars passed correctly or uber jar has correct dependencies when packing jar..
There might be jar conflict also check that carefully with local mode environment since thats working fine.
Further reading Spark spark-submit --jars arguments wants comma list, how to declare a directory of jars?

Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

I have a spark ec2 cluster where I am submitting a pyspark program from a Zeppelin notebook. I have loaded the hadoop-aws-2.7.3.jar and aws-java-sdk-1.11.179.jar and place them in the /opt/spark/jars directory of the spark instances. I get a java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
Why is spark not seeing the jars? Do I have to have to jars in all the slaves and specify a spark-defaults.conf for the master and slaves? Is there something that needs to be configured in zeppelin to recognize the new jar files?
I have placed jar files /opt/spark/jars on the spark master. I have created a spark-defaults.conf and added the lines
spark.hadoop.fs.s3a.access.key [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key [SECRET KEY]
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath /opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/aws-java-sdk-1.11.179.jar
I have zeppelin interpreter sending a spark submit to the spark master.
I have also placed the jars in the /opt/spark/jars in the slaves too but did not create a spark-deafults.conf.
%spark.pyspark
#importing necessary libaries
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col
# add aws credentials
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "[ACCESS KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "[SECRET KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#creating the context
sqlContext = SQLContext(sc)
#reading the first csv file and store it in an RDD
rdd1= sc.textFile("s3a://filepath/baby-names.csv").map(lambda line: line.split(","))
#removing the first row as it contains the header
rdd1 = rdd1.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
#converting the RDD into a dataframe
df1 = rdd1.toDF(['year','name', 'percent', 'sex'])
#print the dataframe
df1.show()
Error thrown:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.11.93.90, executor 1): java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 34 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:108)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 34 more

I was able to address the above to make sure I had the correct versions of the hadoop aws jar per the version of spark hadoop that I was running, downloading the correct version of aws-java-sdk, and lastly downloading the dependency jets3t library
In the /opt/spark/jars
sudo wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.11.30/aws-java-sdk-1.11.30.jar
sudo wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar
sudo wget https://repo1.maven.org/maven2/net/java/dev/jets3t/jets3t/0.9.4/jets3t-0.9.4.jar
Testing it out
scala> sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", [ACCESS KEY ID])
scala> sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", [SECRET ACCESS KEY] )
scala> val myRDD = sc.textFile("s3n://adp-px/baby-names.csv")
scala> myRDD.count()
res2: Long = 49

If S3 access is by assume_role from local cluster then below worked for me.
import boto3
import pyspark as pyspark
from pyspark import SparkContext
session = boto3.session.Session(profile_name='profile_name')
sts_connection = session.client('sts')
response = sts_connection.assume_role(RoleArn='arn:aws:iam:::role/role_name', RoleSessionName='role_name',DurationSeconds=3600)
credentials = response['Credentials']
conf = pyspark.SparkConf()
conf.set('spark.jars.packages', 'org.apache.hadoop:hadoop-aws:3.2.0') //crosscheck the version.
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set('fs.s3a.aws.credentials.provider', 'org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider')
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', credentials['AccessKeyId'])
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', credentials['SecretAccessKey'])
sc._jsc.hadoopConfiguration().set('fs.s3a.session.token', credentials['SessionToken'])
url = str('s3a://data.csv')
l1 = sc.textFile(url).collect()
for each in l1:
print(str(each))
break
keep below proper version of class files also in $SPARK_HOME/jars
jets3t
aws-java-sdk
hadoop-aws
I prefer to delete unwanted jars from ~/.ivy2/jars

From the official Hadoop troubleshooting documentation:
ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem
These are Hadoop filesystem client classes, found in the `hadoop-aws`
JAR. An exception reporting this class as missing means that this JAR
is not on the classpath.
To solve this problem first need to know what is org.apache.hadoop.fs.s3a:
In the Hadoop website, it explains in detail what Hadoop-AWS module: Integration with Amazon Web Services is. And the prerequisite to use it is having these two jars installed under /Spark/jars directory:
hadoop-aws Jar
aws-java-sdk-bundle Jar
When downloading these jars, make sure two things:
Hadooop version matches with hadoop-aws version, a hadoop-aws-3.xx.jar works for a hadoop-3.xx
aws SDK for Java matches the Java version installed. Check this official document from AWS on exact version requirements.
For more troubleshooting, can always refer to the official Hadoop troubleshooting documentation:

Following worked for me
My system config:
Ubuntu 16.04.6 LTS
python3.7.7
openjdk version 1.8.0_252
spark-2.4.5-bin-hadoop2.7
Configure PYSPARK_PYTHON path:
add following line in $spark_home/conf/spark-env.sh
export PYSPARK_PYTHON= python_env_path/bin/python
Start pyspark
pyspark --packages com.amazonaws:aws-java-sdk-pom:1.11.760,org.apache.hadoop:hadoop-aws:2.7.0 --conf spark.hadoop.fs.s3a.endpoint=s3.us-west-2.amazonaws.com
com.amazonaws:aws-java-sdk-pom:1.11.760 : depends on jdk version
hadoop:hadoop-aws:2.7.0: depends on your hadoop version
s3.us-west-2.amazonaws.com: depends on your s3 location
3.Read data from s3
df2=spark.read.parquet("s3a://s3location_file_path")
Credits

Each hadoop version should match aws-java-sdk-...jar, hadoop-aws-...jar.
And every aws-java-sdk version matched with hadoop-aws-..jar (it does not mean the same number).
For example ( aws-java-sdk-bundle-1.11.375.jar, hadoop-aws-3.2.0.jar are pair versions).
Lastly you should enroll the s3 domain in the hive.cnf configuration file.

If nothing works in the above then do a cat and grep for the missing class. High possibility that the Jar is corrupted.
For example, if you get class AmazonServiceException not found, then do a grep where the jar is already present as shown below.
grep "AmazonServiceException" *.jar

Add the following to this file hadoop/etc/hadoop/core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>***</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>***</value>
</property>
Inside the Hadoop installation directory, find aws jars, for MAC installation directory is /usr/local/Cellar/hadoop/
find . -type f -name "*aws*"
sudo cp hadoop/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/
sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/
Credit

AWS-Java-SDK version issue with hadoop 2.7.7

i am running a simple spark app to get file from s3 in rdd and convert it into pyspark dataframe:
data=sc.textFile('s3a://bigdata-plat/churnData/transaction.csv')
df=data.toDF()
also tried,
data=sc.textFile('s3a://bigdata-plat/churnData/transaction.csv')
df = data.map(lambda x: Row(**f(x))).toDF()
but it gives same error:
java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:253)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:251)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:61)
at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
i am setting spark context as:
pyspark.SparkConf().setAll([('spark.eventLog.dir', '/spark/logs/tmp/')
,("spark.driver.extraClassPath","path/hadoop-common-2.7.7.jar:/path/aws-java-sdk-1.10.6.jar:path/hadoop-aws-2.7.7.jar")
,("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
,("fs.s3a.access.key", AWS_ACCESS_KEY)
,("fs.s3a.secret.key", AWS_SECRET_KEY)])
I am using Spark 2.4 , hadoop 2.7.7
aws-java-sdk versions tried : 1.11.440, 1.11.75, 1.10.6, 1.7.4
i am unable to understand here is it dependency issue?
or i am missing any additional jar files that are needed?
any solution?

The AWS SDKs are pretty brittle. You need to use the exact version of the AWS SDK the hadoop-aws connector was built with, otherwise things either don't link properly or fail in various ways.
For the files you need, see:
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.7
PS, no need to set spark.hadoop.fs.s3a.impl. That binding is automatic

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Unable to create RDD from data in HDFS - apache-spark

"/training/intel/NYSE.csv" means "start looking from the top level directory". Either change it to "/user/training/intel/NYSE.csv" or just "training/intel/NYSE.csv" (no leading /) to reference a file relative to your current directory.

Related

java.lang.InstantiationError: com.datastax.oss.driver.internal.core.util.collection.QueryPlan while running spark-cassandra connector

Spark SFTP library can not download the file from sftp server when running in EMR

Unable to read Hbase data with spark in yarn cluster mode

Spark + s3 - error - java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

AWS-Java-SDK version issue with hadoop 2.7.7

Categories

Resources