I get a lot of data uploaded to an S3 bucket that I want so analyze/visualize using Spark and Zeppelin. Yet, I am still stuck at loading data from S3.
I did some reading in order to get this together and spare me gory details. I am using the docker container p7hb/docker-spark as Spark installation and my basic test for reading data from S3 is derived from here:
I start the container and a master and a slave process within. I can validate this works by looking at the Spark Master WebUI, exposed on port 8080. This page does list the worker and keeps a log of all my failed attempts under the headline "Completed Applications". All of those are in the state FINISHED.
I open a bash inside that container and do the following:
a) export the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as suggested here.
b) start spark-shell. In order to access S3 one seems to need to load some extra packages. Browsing through SE I found especially this, which teaches me, that I can use the --packages parameter to load said packages. Essentially I run spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(, for arbitrary combinations of versions).
c) I run the following code
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-eu-central-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
val sonnets=sc.textFile("s3a://my-bucket/my.file")
val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
And then I get all kinds of different Error messages, depending on the versions I choose in 2b).
I suppose there is nothing wrong with 2a), b/c I get the error message Unable to load AWS credentials from any provider in the chain if I don't supply those. This is a known error new users seem to make.
While trying to solve the issue, I pick more or less random versions from here and there for the two extra packages. Somewhere on SE I read that hadoop-aws:2.7 is supposed to be the right choice, because Spark 2.2 is based on Hadoop 2.7. Supposedly one needs to use aws-java-sdk:1.7 with that version of hadoop-aws.
Whatever! I tried thefollowing combinations
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1, which yields the common Bad Request 400 error.
Many problems can lead to that error, my attempt as described above containseverything I was able to find on this page. The description above contains s3-eu-central-1.amazonaws.com as endpoint, while other places use s3.eu-central-1.amazonaws.com. According to enter link description here, both endpoint names are supposed to work. I did try both.
--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5, which are the most recent micro versions in either case, I get the error message java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto
r;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.7.5, I also get java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.1, I get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.8.12,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.9.0, I also get java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
And, for completeness sake, when I don't provide the --packages parameter, I get java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
Currently nothing seems to work. Yet, there are so many Q/As on this topic, who knows what's the way du jour of doing this. This is all in local mode, so there is virtually no other source of error. My method of accessing S3 must be wrong. How is it done correctly?
Edit 1:
So I put another day into this, without any actual progress. As far as I can tell, starting from Hadoop 2.6, Hadoop doesn't have built in support for S3 anymore, but it as to be loaded through additional libraries, which are not part of Hadoop and entirely managed by itself. Besides all the clutter, the library I ultimately want seems to be hadoop-aws. It has a webpage here andit carries what I would call authoritative information:
The versions of hadoop-common and hadoop-aws must be identical.
The important thing about this information is, that hadoop-common actually does ship with a Hadoop installation. Every Hadoop installation has a corresponding jar file, so this is a solid starting point. My containers have a file /usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar so it is fair to assume 2.7.3 is the version I need for hadoop-aws.
After that it gets murky. Hadoop versions 2.7.x have something going on internally, so that they are not compatible with more recent versions of aws-java-sdk, which is a library required by hadoop-aws. The Internet is full of advice to use version 1.7.4, for example here, but other comments suggest to using version 1.7.14 for 2.7.x.
So I did another run using hadoop-aws:2.7.3 and aws-java-sdk:1.7.x, with x ranging from 4 to 14. No results whatsover, I always end up with error 400, Bad Request.
My Hadoop installation ships joda-time 2.9.4. I read the problem was resolved with Hadoop 2.8. I suppose I will just go ahead and build my own docker containers with more recent versions.
Edit 2
Moved to Hadoop 2.8.3. It just works now. Turns out you don't even have to mess around with JARs at all. Hadoop ships with what are supposed to be working JARs for accessing AWS S3. They are hidden in ${HADOOP_HOME}/share/hadoop/tools/lib and not added to the classpath by default. I simply load the JARS in that directory, execute my code as stated above and now it works.
Mixing and matching AWS SDK JARs with anything else is an exercise in futility, as you've discovered. You need the version of the AWS JARs Hadoop was built with, and the version of Jackson AWS was built with. Oh, and don't try mixing any of (different amazon-* JARs, different hadoop-* JARs, different jackson-* JARs); they all go in lock-sync.
For Spark 2.2.0 and Hadoop 2.7, use AWS 1.7.4 artifacts, and make sure that if you are on Java 8, that Joda time is > 2.8.0, such as 2.9.4. That can lead to 400 "bad auth problems".
Otherwise, try Troubleshooting S3A
Although there are some examples (and questions) on how to submit Spark-Jobs via the YARN-REST-API, there are none, that address the specific changes required to make it work with Spark2. I'm currently basing my work off this example and accompanying documentation, but one thing already is quite clear: Spark2 no longer requires a Spark-assembly jar on HDFS, as far as I can tell. Instead on HDP, there's a spark2-hdp-yarn-archive.tar.gz deployed in HDFS.
Now, I wonder how I would have to configure the local-resources in am-container-spec to make the container Spark2 compatible:
How would I build the classpath for REST-compatbility (in particular __spark.jar__)?
Can I reduce the overall amount of duplicate configuration (as suggested here, for the Java YARN-API)?
I have a streaming job that need to be launched through Zeppelin. However, it is very big project. As far as I know, the program launched on Zeppelin is notebook style, with several hundreds lines of code at most.
My code has several thousands of lines, with many classes and objects. How can I launch such big project on Zeppelin.
For some particular requirement, I have to do this......
The correct and supported way to do this, is to extract an API to your code compile it as a jar library, and use Zeppelin's dependency handling [see interpreter settings] to add the jar of your existing project to Zeppelin. Then you can call the complex methods of your project from within Zeppelin, using compact and notebook-compatible bits of code.
In the Spark 2.0.0's release note, it says that:
Spark 2.0 no longer requires a fat assembly jar for production deployment.
Does this mean that we do not need to make a fat jar anymore for submitting jobs ?
If yes, how ? Thus the documentation here isn't up-to-date.
Does this mean that we do not need to make a fat jar anymore for
submitting jobs ?
Sadly, no. You still have to create an uber JARs for Sparks deployment.
The title from the release notes is very misleading. The actual meaning is that Spark itself as a dependency is no longer compiled into an uber JAR, but acts like a normal application JAR with dependencies. You can see this in more detail # SPARK-11157 which is called "Allow Spark to be built without assemblies", and read the paper called "Replacing the Spark Assembly with good old
jars" which describes the pros and cons of deploying Spark not as several huge JARs (Core, Streaming, SQL, etc..) but as a several relatively regular sized JARs containing the code and a lib/ directory with all the related dependencies.
If you really want the details, this pull request touches several key parts.
I just start learning spark. I have imported spark source code to IDEA and made some small changes (just add some println()) to spark source code. What should I do to see these updates? Should I recompile the spark? Thanks!
At the bare minimum, you will need maven 3.3.3 and Java 7+.
You can follow the steps at http://spark.apache.org/docs/latest/building-spark.html
The "make-distribution.sh" script is quite handy which comes within the spark source code root directory. This script will produce a distributable tar.gz which you can simply extract and launch spark-shell or spark-submit. After making the source code changes in spark, you can run this script with the right options (mainly passing the desired hadoop version, yarn or hive support options but these are required if you want to run on top of hadoop distro, or want to connect to existing hive).
BTW, inserting println() will not be a good idea as it can severely slow down the performance of the job. You should use a logger instead.