Unable to access S3 data using Spark 2.2 - apache-spark

I get a lot of data uploaded to an S3 bucket that I want so analyze/visualize using Spark and Zeppelin. Yet, I am still stuck at loading data from S3.
I did some reading in order to get this together and spare me gory details. I am using the docker container p7hb/docker-spark as Spark installation and my basic test for reading data from S3 is derived from here:
I start the container and a master and a slave process within. I can validate this works by looking at the Spark Master WebUI, exposed on port 8080. This page does list the worker and keeps a log of all my failed attempts under the headline "Completed Applications". All of those are in the state FINISHED.
I open a bash inside that container and do the following:
a) export the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as suggested here.
b) start spark-shell. In order to access S3 one seems to need to load some extra packages. Browsing through SE I found especially this, which teaches me, that I can use the --packages parameter to load said packages. Essentially I run spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(, for arbitrary combinations of versions).
c) I run the following code
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-eu-central-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
val sonnets=sc.textFile("s3a://my-bucket/my.file")
val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
And then I get all kinds of different Error messages, depending on the versions I choose in 2b).
I suppose there is nothing wrong with 2a), b/c I get the error message Unable to load AWS credentials from any provider in the chain if I don't supply those. This is a known error new users seem to make.
While trying to solve the issue, I pick more or less random versions from here and there for the two extra packages. Somewhere on SE I read that hadoop-aws:2.7 is supposed to be the right choice, because Spark 2.2 is based on Hadoop 2.7. Supposedly one needs to use aws-java-sdk:1.7 with that version of hadoop-aws.
Whatever! I tried thefollowing combinations
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1, which yields the common Bad Request 400 error.
Many problems can lead to that error, my attempt as described above containseverything I was able to find on this page. The description above contains s3-eu-central-1.amazonaws.com as endpoint, while other places use s3.eu-central-1.amazonaws.com. According to enter link description here, both endpoint names are supposed to work. I did try both.
--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5, which are the most recent micro versions in either case, I get the error message java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto
r;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.7.5, I also get java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.1, I get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.8.12,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.9.0, I also get java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
And, for completeness sake, when I don't provide the --packages parameter, I get java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
Currently nothing seems to work. Yet, there are so many Q/As on this topic, who knows what's the way du jour of doing this. This is all in local mode, so there is virtually no other source of error. My method of accessing S3 must be wrong. How is it done correctly?
Edit 1:
So I put another day into this, without any actual progress. As far as I can tell, starting from Hadoop 2.6, Hadoop doesn't have built in support for S3 anymore, but it as to be loaded through additional libraries, which are not part of Hadoop and entirely managed by itself. Besides all the clutter, the library I ultimately want seems to be hadoop-aws. It has a webpage here andit carries what I would call authoritative information:
The versions of hadoop-common and hadoop-aws must be identical.
The important thing about this information is, that hadoop-common actually does ship with a Hadoop installation. Every Hadoop installation has a corresponding jar file, so this is a solid starting point. My containers have a file /usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar so it is fair to assume 2.7.3 is the version I need for hadoop-aws.
After that it gets murky. Hadoop versions 2.7.x have something going on internally, so that they are not compatible with more recent versions of aws-java-sdk, which is a library required by hadoop-aws. The Internet is full of advice to use version 1.7.4, for example here, but other comments suggest to using version 1.7.14 for 2.7.x.
So I did another run using hadoop-aws:2.7.3 and aws-java-sdk:1.7.x, with x ranging from 4 to 14. No results whatsover, I always end up with error 400, Bad Request.
My Hadoop installation ships joda-time 2.9.4. I read the problem was resolved with Hadoop 2.8. I suppose I will just go ahead and build my own docker containers with more recent versions.
Edit 2
Moved to Hadoop 2.8.3. It just works now. Turns out you don't even have to mess around with JARs at all. Hadoop ships with what are supposed to be working JARs for accessing AWS S3. They are hidden in ${HADOOP_HOME}/share/hadoop/tools/lib and not added to the classpath by default. I simply load the JARS in that directory, execute my code as stated above and now it works.

Mixing and matching AWS SDK JARs with anything else is an exercise in futility, as you've discovered. You need the version of the AWS JARs Hadoop was built with, and the version of Jackson AWS was built with. Oh, and don't try mixing any of (different amazon-* JARs, different hadoop-* JARs, different jackson-* JARs); they all go in lock-sync.
For Spark 2.2.0 and Hadoop 2.7, use AWS 1.7.4 artifacts, and make sure that if you are on Java 8, that Joda time is > 2.8.0, such as 2.9.4. That can lead to 400 "bad auth problems".
Otherwise, try Troubleshooting S3A

Related

java.lang.NumberFormatException: in Pyspark when writing to S3

I am trying to read compressed log file from S3 bucket using pyspark on EC2 instance.
EC2 instance has read permission to S3 bucket as I am able to manually download the file using AWS CLI command.
This is how my code looks like
file_path= 's3a://<bucket_name>/<path_of_file>'
rdd1 = sc.textFile(file_path)
rdd1.take(3)
But I am getting below error
*py4j.protocol.Py4JJavaError: An error occurred while calling o36.partitions.
: java.lang.NumberFormatException: For input string: "64M"*
Can somebody help me out?
you are mixing versions of hadoop-common with an older version of hadoop-aws.
the s3a connector added support for using a unit when declaring multipart block size in 2016, eight years ago, in https://issues.apache.org/jira/browse/HADOOP-13680.
hadoop-common JAR versions 2.8+ set it to "64M"
if the version of the s3a connector you are using can't cope with that, it means it is nine years old
please
upgrade your hadoop-* jars to a recent version, ideally 3.3.0+
make sure they are all the same version unless you enjoy seeing stack traces
and use the exact same aws-sdk-bundle jar which hadoop was built with unless you want to see different stack traces.
thisis not an opinion, these are instructions from the hadoop-aws maintenance team.

stop hive's RetryingHMSHandler logging to databricks cluster

I'm using azure databricks 5.5 LTS with spark 2.4.3 and scala 2.11. Almost every request going to the databricks cluster is coming up with the following error log
ERROR RetryingHMSHandler: NoSuchObjectException(message:There is no database named global_temp)
at org.apache.hadoop.hive.metastore.ObjectStore.getMDatabase(ObjectStore.java:487)
at org.apache.hadoop.hive.metastore.ObjectStore.getDatabase(ObjectStore.java:498)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
While this isn't affecting the end-result of what we're trying to do, our logs are constantly getting filled with this and isn't very pleasant to go through. I've tried turning it off by setting the following property to the driver and executor
log4j.level.org.apache.hadoop.hive.metastore.RetryingHMSHandler=OFF
only to, later on, realize the class RetryingHMSHandler actually uses slf4j logger, is there an elegant way to overcome this?
Maybe late, but faced the same issue with Databricks cluster 9.1 LTS (Apache Spark 3.1.2, Scala 2.12). Solved by using a init script that added the following two properties
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL, publicFile
log4j.additivity.org.apache.hadoop.hive.metastore.RetryingHMSHandler=false
to driver's log4j.properties.
My goal was to remove all verbose logs from the "log4j-active.log" file that can be downloaded from a job UI. By following https://learn.microsoft.com/en-us/azure/databricks/kb/clusters/overwrite-log4j-logs, I decided to add/overwrite some property values within driver's log4j.properties (first I had a look at its content, of course).
Added that two properties, I was able to remove also RetryingHMSHandler (the only third-party log call that was still surviving)
Hope it helps ;)

spark LOCAL and alluxio client

I'm running spark in LOCAL mode and trying to get it to talk to alluxio. I'm getting the error:
java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
I have looked at the page here:
https://www.alluxio.org/docs/master/en/Debugging-Guide.html#q-why-do-i-see-exceptions-like-javalangruntimeexception-javalangclassnotfoundexception-class-alluxiohadoopfilesystem-not-found
Which details the steps to take in this situation, but I'm not finding success.
According to Spark documentation, I can instance a local Spark like so:
SparkSession.builder
.appName("App")
.getOrCreate
Then I can add the alluxio client library like so:
sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT)
sparkSession.conf.set("spark.executor.extraClassPath", ALLUXIO_SPARK_CLIENT)
I have verified that the proper jar file exists in the right location on my local machine with:
logger.error(sparkSession.conf.get("spark.driver.extraClassPath"))
logger.error(sparkSession.conf.get("spark.executor.extraClassPath"))
But I still get the error. Is there anything else I can do to figure out why Spark is not picking the library up?
Please note I am not using spark-submit - I am aware of the methods for adding the client jar to a spark-submit job. My Spark instance is being created as local within my application and this is the use case I want to solve.
As an FYI there is another application in the cluster which is connecting to my alluxio using the fs client and that all works fine. In that case, though, the fs client is being packaged as part of the application through standard sbt dependencies.
Thanks
In the hopes that this helps someone else:
My problem here was not that the library wasn't getting loaded or wasn't on the classpath, it was that I was using the "fs" version of the client rather than the "hdfs" version.
I had been using a generic 1.4 client - at some point this client was split into a fs version and an hdfs version. When I updated this for 1.7 recently I mistakenly added the "fs" version.

How to install a postgresql JDBC driver in pyspark

I use pyspark with spark 2.2.0 on a lubuntu 16.04 and I want to write a Dataframe to my Postgresql database. Now as far as I understand it I have to install a jdbc driver on the spark master for it. I downloaded the postgresql jdbc driver from their website and tried to follow this post. I added spark.jars.packages /path/to/driver/postgresql-42.2.1.jar to spark-default.conf with the only result that pyspark no longer launches.
I'm kinda lost in java land for one I don't know if this is the right format.The documentation tells me I should add a list but I don't know how a path list is supposed to look like. Then I don't know if I also have to specify spark.jars and or spark.driver.extraClassPath or if spark.jars.packages is enough? And if i have to add them what kind of format are they?
spark.jars.packages is for dependencies that can be pulled from Maven (think it as pip for Java, although the analogy is probably kinda loose).
You can submit your job with the option --jars /path/to/driver/postgresql-42.2.1.jar, so that the submission will also provide the library, that the cluster manager will distribute on all worker nodes on your behalf.
If you want to set this as a configuration you can use the spark.jars key instead of spark.jars.packages. The latter requires Maven coordinates, rather then a path (which is probably the reason why your job is failing).
You can read more about the configuration keys I introduced on the official documentation.

excluding hadoop from spark build

I'm modifying hdfs module inside hadoop, and would like too see the reflection while i'm running spark on top of it, but I still see the native hadoop behaviour. I've checked and saw Spark is building a really fat jar file, which contains all hadoop classes (using hadoop profile defined in maven), and deploy it over all workers. I also tried bigtop-dist, to exclude hadoop classes but see no effect.
Is it possible to do such a thing easily, for example by small modifications inside the maven file?
I believe you are looking for the provided scope on maven artifacts. It allows you to exclude certain classes in packaging while allowing you to compile against them (with the expectation that your runtime environment will provide them at their correct respective versions). See here and here for further discussion.

Resources