spark LOCAL and alluxio client - apache-spark

I'm running spark in LOCAL mode and trying to get it to talk to alluxio. I'm getting the error:
java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
I have looked at the page here:
https://www.alluxio.org/docs/master/en/Debugging-Guide.html#q-why-do-i-see-exceptions-like-javalangruntimeexception-javalangclassnotfoundexception-class-alluxiohadoopfilesystem-not-found
Which details the steps to take in this situation, but I'm not finding success.
According to Spark documentation, I can instance a local Spark like so:
SparkSession.builder
.appName("App")
.getOrCreate
Then I can add the alluxio client library like so:
sparkSession.conf.set("spark.driver.extraClassPath", ALLUXIO_SPARK_CLIENT)
sparkSession.conf.set("spark.executor.extraClassPath", ALLUXIO_SPARK_CLIENT)
I have verified that the proper jar file exists in the right location on my local machine with:
logger.error(sparkSession.conf.get("spark.driver.extraClassPath"))
logger.error(sparkSession.conf.get("spark.executor.extraClassPath"))
But I still get the error. Is there anything else I can do to figure out why Spark is not picking the library up?
Please note I am not using spark-submit - I am aware of the methods for adding the client jar to a spark-submit job. My Spark instance is being created as local within my application and this is the use case I want to solve.
As an FYI there is another application in the cluster which is connecting to my alluxio using the fs client and that all works fine. In that case, though, the fs client is being packaged as part of the application through standard sbt dependencies.
Thanks

In the hopes that this helps someone else:
My problem here was not that the library wasn't getting loaded or wasn't on the classpath, it was that I was using the "fs" version of the client rather than the "hdfs" version.
I had been using a generic 1.4 client - at some point this client was split into a fs version and an hdfs version. When I updated this for 1.7 recently I mistakenly added the "fs" version.

Related

Is there an option to disable specific jars when using spark-submit?

I am running the jar file through spark-submit. (yarn, client mode)
When running the jar, the library installed in the class path in the spark cluster server is used.
However, I get an error because of version dependencies between certain libraries.
I don't want to change a component in the server classpath. Because it can be used elsewhere.
So the question is, is there an option to disable certain libraries on spark cluster server only when running spark-submit?
or is there any other way to solve my problem?
I look forward to your reply. thank you.

Unable to access S3 data using Spark 2.2

I get a lot of data uploaded to an S3 bucket that I want so analyze/visualize using Spark and Zeppelin. Yet, I am still stuck at loading data from S3.
I did some reading in order to get this together and spare me gory details. I am using the docker container p7hb/docker-spark as Spark installation and my basic test for reading data from S3 is derived from here:
I start the container and a master and a slave process within. I can validate this works by looking at the Spark Master WebUI, exposed on port 8080. This page does list the worker and keeps a log of all my failed attempts under the headline "Completed Applications". All of those are in the state FINISHED.
I open a bash inside that container and do the following:
a) export the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as suggested here.
b) start spark-shell. In order to access S3 one seems to need to load some extra packages. Browsing through SE I found especially this, which teaches me, that I can use the --packages parameter to load said packages. Essentially I run spark-shell --packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5(, for arbitrary combinations of versions).
c) I run the following code
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-eu-central-1.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
val sonnets=sc.textFile("s3a://my-bucket/my.file")
val counts = sonnets.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
And then I get all kinds of different Error messages, depending on the versions I choose in 2b).
I suppose there is nothing wrong with 2a), b/c I get the error message Unable to load AWS credentials from any provider in the chain if I don't supply those. This is a known error new users seem to make.
While trying to solve the issue, I pick more or less random versions from here and there for the two extra packages. Somewhere on SE I read that hadoop-aws:2.7 is supposed to be the right choice, because Spark 2.2 is based on Hadoop 2.7. Supposedly one needs to use aws-java-sdk:1.7 with that version of hadoop-aws.
Whatever! I tried thefollowing combinations
--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1, which yields the common Bad Request 400 error.
Many problems can lead to that error, my attempt as described above containseverything I was able to find on this page. The description above contains s3-eu-central-1.amazonaws.com as endpoint, while other places use s3.eu-central-1.amazonaws.com. According to enter link description here, both endpoint names are supposed to work. I did try both.
--packages com.amazonaws:aws-java-sdk:1.7.15,org.apache.hadoop:hadoop-aws:2.7.5, which are the most recent micro versions in either case, I get the error message java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecuto
r;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.7.5, I also get java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.1, I get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.8.12,org.apache.hadoop:hadoop-aws:2.8.3, I also get java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
--packages com.amazonaws:aws-java-sdk:1.11.275,org.apache.hadoop:hadoop-aws:2.9.0, I also get java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
And, for completeness sake, when I don't provide the --packages parameter, I get java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.
Currently nothing seems to work. Yet, there are so many Q/As on this topic, who knows what's the way du jour of doing this. This is all in local mode, so there is virtually no other source of error. My method of accessing S3 must be wrong. How is it done correctly?
Edit 1:
So I put another day into this, without any actual progress. As far as I can tell, starting from Hadoop 2.6, Hadoop doesn't have built in support for S3 anymore, but it as to be loaded through additional libraries, which are not part of Hadoop and entirely managed by itself. Besides all the clutter, the library I ultimately want seems to be hadoop-aws. It has a webpage here andit carries what I would call authoritative information:
The versions of hadoop-common and hadoop-aws must be identical.
The important thing about this information is, that hadoop-common actually does ship with a Hadoop installation. Every Hadoop installation has a corresponding jar file, so this is a solid starting point. My containers have a file /usr/hadoop-2.7.3/share/hadoop/common/hadoop-common-2.7.3.jar so it is fair to assume 2.7.3 is the version I need for hadoop-aws.
After that it gets murky. Hadoop versions 2.7.x have something going on internally, so that they are not compatible with more recent versions of aws-java-sdk, which is a library required by hadoop-aws. The Internet is full of advice to use version 1.7.4, for example here, but other comments suggest to using version 1.7.14 for 2.7.x.
So I did another run using hadoop-aws:2.7.3 and aws-java-sdk:1.7.x, with x ranging from 4 to 14. No results whatsover, I always end up with error 400, Bad Request.
My Hadoop installation ships joda-time 2.9.4. I read the problem was resolved with Hadoop 2.8. I suppose I will just go ahead and build my own docker containers with more recent versions.
Edit 2
Moved to Hadoop 2.8.3. It just works now. Turns out you don't even have to mess around with JARs at all. Hadoop ships with what are supposed to be working JARs for accessing AWS S3. They are hidden in ${HADOOP_HOME}/share/hadoop/tools/lib and not added to the classpath by default. I simply load the JARS in that directory, execute my code as stated above and now it works.
Mixing and matching AWS SDK JARs with anything else is an exercise in futility, as you've discovered. You need the version of the AWS JARs Hadoop was built with, and the version of Jackson AWS was built with. Oh, and don't try mixing any of (different amazon-* JARs, different hadoop-* JARs, different jackson-* JARs); they all go in lock-sync.
For Spark 2.2.0 and Hadoop 2.7, use AWS 1.7.4 artifacts, and make sure that if you are on Java 8, that Joda time is > 2.8.0, such as 2.9.4. That can lead to 400 "bad auth problems".
Otherwise, try Troubleshooting S3A

How to install a postgresql JDBC driver in pyspark

I use pyspark with spark 2.2.0 on a lubuntu 16.04 and I want to write a Dataframe to my Postgresql database. Now as far as I understand it I have to install a jdbc driver on the spark master for it. I downloaded the postgresql jdbc driver from their website and tried to follow this post. I added spark.jars.packages /path/to/driver/postgresql-42.2.1.jar to spark-default.conf with the only result that pyspark no longer launches.
I'm kinda lost in java land for one I don't know if this is the right format.The documentation tells me I should add a list but I don't know how a path list is supposed to look like. Then I don't know if I also have to specify spark.jars and or spark.driver.extraClassPath or if spark.jars.packages is enough? And if i have to add them what kind of format are they?
spark.jars.packages is for dependencies that can be pulled from Maven (think it as pip for Java, although the analogy is probably kinda loose).
You can submit your job with the option --jars /path/to/driver/postgresql-42.2.1.jar, so that the submission will also provide the library, that the cluster manager will distribute on all worker nodes on your behalf.
If you want to set this as a configuration you can use the spark.jars key instead of spark.jars.packages. The latter requires Maven coordinates, rather then a path (which is probably the reason why your job is failing).
You can read more about the configuration keys I introduced on the official documentation.

What setup is needed to use the Spark Cassandra Connector with Spark Job Server

I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.
I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.

How to get access to HDFS files in Spark standalone cluster mode?

I am trying to get access to HDFS files in Spark. Everything works fine when I run Spark in local mode, i.e.
SparkSession.master("local")
and get access to HDFS files by
hdfs://localhost:9000/$FILE_PATH
But when I am trying to run Spark in standalone cluster mode, i.e.
SparkSession.master("spark://$SPARK_MASTER_HOST:7077")
Error throws
java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.fun$1 of type org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1
So far I have only
start-dfs.sh
in Hadoop and does not really config anything in Spark. Do I need to run Spark using YARN cluster manager instead so that Spark and Hadoop are using the same cluster manager, hence can get access to HDFS files?
I have tried to config yarn-site.xml in Hadoop following tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm, and specified HADOOP_CONF_DIR in spark-env.sh, but it does not seem to work and the same error throws. Am I missing some other configurations?
Thanks!
EDIT
The initial Hadoop version is 2.8.0 and the Spark version is 2.1.1 with Hadoop 2.7. Tried to download hadoop-2.7.4 but the same error still exists.
The question here suggests this as a java syntax issue rather than spark hdfs issue. I will try this approach and see if this solves the error here.
Inspired by the post here, solved the problem by myself.
This map-reduce job depends on a Serializable class, so when running in Spark local mode, this serializable class can be found and the map-reduce job can be executed dependently.
When running in Spark standalone cluster mode, the best is to submit the application through spark-submit, rather than running in an IDE. Packaged everything in jar and spark-submit the jar, works as a charm!

Resources