i'm using Spark 2.4.8 with the gcs-connector from com.google.cloud.bigdataoss in version hadoop2-2.1.8. For development i'm using a Compute Engine VM with my IDE. I try to consume some CSV files from a GCS bucket natively with the Spark .csv(...).load(...) functionality. Some files are loaded successfully, but some are not. Then in the Spark UI i can see that the load job runs forever until a timeout fires.
But the weird thing is, that when i run the same application packaged to a Fat-JAR in Dataproc cluster, all the same files can be consumed successfully.
What i am doing wrong?
#JanOels, As you have mentioned in the comment, using gcs-connector in version hadoop2-2.2.8 will resolve this issue and the latest version of hadoop2 is hadoop2-2.2.10.
For more information about all the versions of hadoop2 to use gcs-connector from com.google.cloud.bigdataoss this document can be referred.
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.
I am new to the spark world and to some extent coding.
This question might seem too basic but please clear my confusion.
I know that we have to import spark libraries to write spark application. I use intellij and sbt.
After writing the application , I can also run them and see the output on "run".
My question is, why should I install spark separately on my machine(local) if I can just import them as libraries and run them.
Also what is the need for it to be installed on the cluster since we can just submit the jar file and jvm is already present in all the machines of the clustor
Thank you for the help!
I understand your confusion.
Actually you don't really need to install spark on your machine if you are for example running it on scala/java and you can just import spark-core or any other dependancies into your project and once you start your spark job on mainClass it will create an standalone spark runner on your machine and run your job on if (local[*]).
There are many reasons for having spark on your local machine.
One of them is for running spark job on pyspark which requires spark/python/etc libraries and a runner(local[] or remote[]).
Another reason can be if you want to run your job on-premise.
It might be easier to create cluster on your local datacenter and maybe appoint your machine as master and the other machines connected to your master as worker.(this solution might be abit naive but you asked for basics so this might spark your curiosity to read more about infrastructure design of a data processing system more)
I get a lot of data uploaded to an S3 bucket that I want so analyze/visualize using Spark and Zeppelin. Yet, I am still stuck at loading data from S3.
I did some reading in order to get this together and spare me gory details. I am using the docker container p7hb/docker-spark as Spark installation and my basic test for reading data from S3 is derived from here:
I start the container and a master and a slave process within. I can validate this works by looking at the Spark Master WebUI, exposed on port 8080. This page does list the worker and keeps a log of all my failed attempts under the headline "Completed Applications". All of those are in the state FINISHED.
I open a bash inside that container and do the following:
a) export the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as suggested here.
b) start spark-shell. In order to access S3 one seems to need to load some extra packages. Browsing through SE I found especially this, which teaches me, that I can use the --packages parameter to load said packages. Essentially I run spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(, for arbitrary combinations of versions).
c) I run the following code
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-eu-central-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
val sonnets=sc.textFile("s3a://my-bucket/my.file")
val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
And then I get all kinds of different Error messages, depending on the versions I choose in 2b).
I suppose there is nothing wrong with 2a), b/c I get the error message Unable to load AWS credentials from any provider in the chain if I don't supply those. This is a known error new users seem to make.
While trying to solve the issue, I pick more or less random versions from here and there for the two extra packages. Somewhere on SE I read that hadoop-aws:2.7 is supposed to be the right choice, because Spark 2.2 is based on Hadoop 2.7. Supposedly one needs to use aws-java-sdk:1.7 with that version of hadoop-aws.
Whatever! I tried thefollowing combinations
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1, which yields the common Bad Request 400 error.
Many problems can lead to that error, my attempt as described above containseverything I was able to find on this page. The description above contains s3-eu-central-1.amazonaws.com as endpoint, while other places use s3.eu-central-1.amazonaws.com. According to enter link description here, both endpoint names are supposed to work. I did try both.
--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5, which are the most recent micro versions in either case, I get the error message java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto
r;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.7.5, I also get java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.1, I get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.8.12,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.9.0, I also get java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
And, for completeness sake, when I don't provide the --packages parameter, I get java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
Currently nothing seems to work. Yet, there are so many Q/As on this topic, who knows what's the way du jour of doing this. This is all in local mode, so there is virtually no other source of error. My method of accessing S3 must be wrong. How is it done correctly?
Edit 1:
So I put another day into this, without any actual progress. As far as I can tell, starting from Hadoop 2.6, Hadoop doesn't have built in support for S3 anymore, but it as to be loaded through additional libraries, which are not part of Hadoop and entirely managed by itself. Besides all the clutter, the library I ultimately want seems to be hadoop-aws. It has a webpage here andit carries what I would call authoritative information:
The versions of hadoop-common and hadoop-aws must be identical.
The important thing about this information is, that hadoop-common actually does ship with a Hadoop installation. Every Hadoop installation has a corresponding jar file, so this is a solid starting point. My containers have a file /usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar so it is fair to assume 2.7.3 is the version I need for hadoop-aws.
After that it gets murky. Hadoop versions 2.7.x have something going on internally, so that they are not compatible with more recent versions of aws-java-sdk, which is a library required by hadoop-aws. The Internet is full of advice to use version 1.7.4, for example here, but other comments suggest to using version 1.7.14 for 2.7.x.
So I did another run using hadoop-aws:2.7.3 and aws-java-sdk:1.7.x, with x ranging from 4 to 14. No results whatsover, I always end up with error 400, Bad Request.
My Hadoop installation ships joda-time 2.9.4. I read the problem was resolved with Hadoop 2.8. I suppose I will just go ahead and build my own docker containers with more recent versions.
Edit 2
Moved to Hadoop 2.8.3. It just works now. Turns out you don't even have to mess around with JARs at all. Hadoop ships with what are supposed to be working JARs for accessing AWS S3. They are hidden in ${HADOOP_HOME}/share/hadoop/tools/lib and not added to the classpath by default. I simply load the JARS in that directory, execute my code as stated above and now it works.
Mixing and matching AWS SDK JARs with anything else is an exercise in futility, as you've discovered. You need the version of the AWS JARs Hadoop was built with, and the version of Jackson AWS was built with. Oh, and don't try mixing any of (different amazon-* JARs, different hadoop-* JARs, different jackson-* JARs); they all go in lock-sync.
For Spark 2.2.0 and Hadoop 2.7, use AWS 1.7.4 artifacts, and make sure that if you are on Java 8, that Joda time is > 2.8.0, such as 2.9.4. That can lead to 400 "bad auth problems".
Otherwise, try Troubleshooting S3A
I have a streaming job that need to be launched through Zeppelin. However, it is very big project. As far as I know, the program launched on Zeppelin is notebook style, with several hundreds lines of code at most.
My code has several thousands of lines, with many classes and objects. How can I launch such big project on Zeppelin.
For some particular requirement, I have to do this......
The correct and supported way to do this, is to extract an API to your code compile it as a jar library, and use Zeppelin's dependency handling [see interpreter settings] to add the jar of your existing project to Zeppelin. Then you can call the complex methods of your project from within Zeppelin, using compact and notebook-compatible bits of code.
I just start learning spark. I have imported spark source code to IDEA and made some small changes (just add some println()) to spark source code. What should I do to see these updates? Should I recompile the spark? Thanks!
At the bare minimum, you will need maven 3.3.3 and Java 7+.
You can follow the steps at http://spark.apache.org/docs/latest/building-spark.html
The "make-distribution.sh" script is quite handy which comes within the spark source code root directory. This script will produce a distributable tar.gz which you can simply extract and launch spark-shell or spark-submit. After making the source code changes in spark, you can run this script with the right options (mainly passing the desired hadoop version, yarn or hive support options but these are required if you want to run on top of hadoop distro, or want to connect to existing hive).
BTW, inserting println() will not be a good idea as it can severely slow down the performance of the job. You should use a logger instead.