Elasticsearch for spark 3.0 - apache-spark

Im getting issues while using spark3.0 for reading elastic.
My elasticsearch version 7.6.0
I used elastic jar of the same version.
Please suggest a solution.

Spark 3.0.0 relies on Scala 2.12, which is not yet supported by Elasticsearch-hadoop. This and a few further issues prevent us from using Spark 3.0.0 together with Elasticsearch. If you want to compile it yourself, there is a pull-request on elasticsearch-hadoop (https://github.com/elastic/elasticsearch-hadoop/pull/1308) which should at least allow using scala 2.12. Not sure if it will fix the other issues as well.

It's officially released for spark 3.0
Enhancements:
https://www.elastic.co/guide/en/elasticsearch/hadoop/7.12/eshadoop-7.12.0.html
Maven Repository:
https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-30_2.12/7.12.0

It is not official for now, but you can compile the dependency on
https://github.com/elastic/elasticsearch, the steps are
git clone https://github.com/elastic/elasticsearch.git
cd elasticsearch-hadoop/
vim ~/.bashrc
export JAVA8_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
source ~/.bashrc
./gradlew elasticsearch-spark-30:distribution --console=plain
and finally you can find .jar package in folder: "elasticsearch-hadoop\spark\sql-30\build\distributions", elasticsearch-spark-30_2.12-8.0.0-SNAPSHOT.jar is the es packages

Related

Apache Zeppelin 0.7.3 Helium volume-leaflet

I am using Apache Zeppelin 0.7.3 and would like to use the volume-leaflet visualization.
volume leaflet npm package info
The above npm package info states at the bottom of the page:
Compatibility
Requires Zeppelin 0.8.0-SNAPSHOT+
So the npm package apparently requires Zeppelin 0.8.0 but I can find no information on Zeppelin's web page on how to download/install 0.8. The latest available version of Zeppelin is 0.7.3. What am I missing here?
And yes, I have tried volume-leaflet with 0.7.3 but had some challenges.
Thanks in advance for any feedback.
Zeppelin 0.8 is still in development. The active documentation can be found here: https://zeppelin.apache.org/docs/0.8.0-SNAPSHOT/. I am not aware of any nightly-builds, so you will need to build zeppelin on your own, see How to build.
However some of the Helium Plugins work with smaller Zeppelin versions, even if they claim not to. You can try this by adding the package specification to the helium.json. I did explain that at a conference lately.

Hive version compatibility with Spark

After various failed tries to use my Hive (1.2.1) with my Spark (Spark 1.4.1 built for Hadoop 2.2.0) I decided to try to build again Spark with Hive.
I would like to know what is the latest Hive version that can be used to build Spark at this point.
When downloading Spark 1.5 source and trying:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-1.2.1 -Phive-thriftserver -DskipTests clean package
I get :
The requested profile "hive-1.2.1" could not be activated because it does not exist.
Any help appreciated
Check your spark 1.5 pom.xml it contains hive 1.2.1 version therefore I don't thing you need to specify the hive version explicitly. Simply use mvn without hive version and it should work.
I'd recommend you to go through this compatibility chart :
http://hortonworks.com/wp-content/uploads/2016/03/asparagus-chart-hdp24.png
Spark website maintains good docs by version number regarding building with Hive support.
e.g. for v1.5 https://spark.apache.org/docs/1.5.0/building-spark.html
Listed example shows 2.4 but as the other answer pointed out above you can leave off the Phive-1.2.1 but according to the docs, if you do that with Spark 1.5.0 it will Build with Hive 0.13 Bindings by default.
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Index of all versions: https://spark.apache.org/docs/
Latest version: https://spark.apache.org/docs/latest/building-spark.html
It appears that it defaults to Hive 1.2.1 bindings from Spark version 1.6.2 onwards. Default doesn't necessarily indicate support limitation though,

How to upgrade Apache Spark version

Currently, I have installed Spark 1.5.0 version on AWS using spark-ec2.sh script.
Now, I want to upgrade my Spark version to 1.5.1. How do i do this? Is there any upgrade procedure or do i have to build it from scratch using the spark-ec2 script? In that case i will lose all my existing configuration.
Please Advise
Thanks
1.5.1 has identical configuration fields with the 1.5.0, I am not aware of any automation tools, but upgrade should be trivial. C/P $SPARK_HOME/conf should suffice. Back up the old files, nevertheless.

How to connect Zeppelin to Spark 1.5 built from the sources?

I pulled the latest source from the Spark repository and built locally. It works great from an interactive shell like spark-shell or spark-sql.
Now I want to connect Zeppelin to my Spark 1.5, according to this install manual. I published the custom Spark build to the local maven repository and set the custom Spark version in the Zeppelin build command. The build process finished successfully but when I try to run basic things like sc inside notebook, it throws:
akka.ConfigurationException: Akka JAR version [2.3.11] does not match the provided config version [2.3.4]
Version 2.3.4 is set in pom.xml and spark/pom.xml, but simply changing them won’t even let me get a build.
If I rebuild Zeppelin with the standard -Dspark.vesion=1.4.1, everything works.
Update 2016-01
Spark 1.6 support has landed to master and is available under -Pspark-1.6 profile.
Update 2015-09
Spark 1.5 support has landed to master and is available under -Pspark-1.5 profile.
Work on supporting Spark 1.5 in Apache Zeppelin (incubating) was done under this PR apache/incubator-zeppelin#269 which will lend to master soon.
For now, building from Spark_1.5 branch with -Pspark-1.5 should do the trick.

pom file java version spec for Maven

I am a new user to Maven, as I am trying to use it to build apache spark on amazon EC2 VMs. I have mannually installed java version 1.7.0 on the VMs. However as I was running the Maven, the following error occurs:
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-core_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed
As I think the java version mismatch is the potential reason, causing the compiling problem. I opened up the pom file of the spark for maven tool, it has declared java related version in two seperate places:
<java.version>1.6</java.version>
and
<aws.java.sdk.version>1.8.3</aws.java.sdk.version>
What are the differences between these two versions?
Which one should be edited to solve the jave version mismatch?
It's two different things
<java.version>1.6</java.version>
is the java version used and
<aws.java.sdk.version>1.8.3</aws.java.sdk.version>
is the AWS SDK for Java version used.
The minumum requirement of AWS SDK 1.9 is Java 1.6+ so there is no compatibility issues.

Resources