I am running an old MapR cluster, mapr3.
How can I build a custom distribution for Spark 1.5.x for mapr3?
What I understand is getting the right hadoop.version is the key step to make everything work.
I went back to version spark 1.3.1 and found the mapr3 profile in which it had hadoop.version=1.0.3-mapr-3.0.3. To build a complete distribution, the following command will work if you have JAVA_HOME set already:
./make-distribution.sh --name custom-spark --tgz -Dhadoop.version=1.0.3-mapr-3.0.3 -Phadoop-1 -DskipTests
Related
Using dataproc image version 2.0.x in google cloud since delta 0.7.0 is available in this dataproc image version. However, this dataproc instance comes with pyspark 3.1.1 default, Apache Spark 3.1.1 has not been officially released yet. So there is no version of Delta Lake compatible with 3.1 yet hence suggested to downgrade.
I have tried the below,
pip install --force-reinstall pyspark==3.0.1
executed the above command as a root user on master node of dataproc instance, however, when I check the pyspark --version it is still showing 3.1.1
how to fix the default pyspark version to 3.0.1?
The simplest way to use Spark 3.0 w/ Dataproc 2.0 is to pin an older Dataproc 2.0 image version (2.0.0-RC22-debian10) that used Spark 3.0 before it was upgraded to Spark 3.1 in the newer Dataproc 2.0 image versions:
gcloud dataproc clusters create $CLUSTER_NAME --image-version=2.0.0-RC22-debian10
To use 3.0.1 version of spark you need to make sure that master and worker nodes in the Dataproc cluster have spark-3.0.1 jars in /usr/lib/spark/jars instead of 3.1.1 ones.
There are two ways you could do that:
Move 3.0.1 jars manually in each node to /usr/lib/spark/jars, and remove 3.1.1 ones. After doing pip install for the desired version of pyspark, you can find the spark jars in /.local/lib/python3.8/site-packages/pyspark/jars. Make sure to restart spark after this: sudo systemctl restart spark*
You can use dataproc init actions (https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions?hl=en) to do the same as then you won't have to ssh each node and manually change the jars.
Steps:
Upload the updated Hadoop jars to a GCS folder, e.g., gs:///lib-updates, which has the same structure with the /usr/lib/ directory of the cluster nodes.
Write an init actions script which syncs updates from GCS to local /usr/lib/, then restart Hadoop services. Upload the script to GCS, e.g., gs:///init-actions-update-libs.sh.
#!/bin/bash
set -o nounset
set -o errexit
set -o xtrace
set -o pipefail
# The GCS folder of lib updates.
LIB_UPDATES=$(/usr/share/google/get_metadata_value attributes/lib-updates)
# Sync updated libraries from $LIB_UPDATES to /usr/lib
gsutil rsync -r -e $LIB_UPDATES /usr/lib/
# Restart spark services
service spark-* restart
Create a cluster with --initialization-actions $INIT_ACTIONS_UPDATE_LIBS and --metadata lib-updates=$LIB_UPDATES.
I am trying to setup Spark notebook in HUE(version 3.11) with Spark 2.0.0 using Livy 0.2.0.
With Spark 1.6.1 the notebook is working perfectly fine.
Livy only supports Scala 2.10 builds of Spark.So I did a build of Spark-2.0.0 with Scala-2.10.6.When I open up spark-shell(2.0.0) it clears says "Using Scala version 2.10.6".
But Spark notebook is not working with this build.In the Spark notebook when I do 1+1 and execute it , it gives the following error.
What could be wrong here?Below is the exception in the logs
"java.util.concurrent.ExecutionException: com.cloudera.livy.rsc.rpc.RpcException: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.render(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/JsonAST$JValue;\ncom.cloudera.livy.repl.ReplDriver$$anonfun$handle$2.apply(ReplDriver.scala:78)\ncom.cloudera.livy.repl.ReplDriver$$anonfun$handle$2.apply(ReplDriver.scala:78)\nscala.Option.map(Option.scala:145)\ncom.cloudera.livy.repl.ReplDriver.handle(ReplDriver.scala:78)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:606)\ncom.cloudera.livy.rsc.rpc.RpcDispatcher.handleCall(RpcDispatcher.java:130)\ncom.cloudera.livy.rsc.rpc.RpcDispatcher.channelRead0(RpcDispatcher.java:77)\nio.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)\nio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)\nio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)\nio.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)\nio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)\nio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)\nio.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)\nio.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)\nio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)\nio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)\nio.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)\nio.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\nio.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\nio.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\nio.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\nio.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\nio.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\njava.lang.Thread.run(Thread.java:745)" (error 500)
This solved my problem.
Download latest Livy code from Git hub.Use the below maven build command
mvn clean package -DskipTests -Dspark-2.0 -Dscala-2.11
I'm not sure that this is even possible.
Relying on release notes, Hue 3.11 not works with Spark 2.0 (it works with Spark 1.6).
After various failed tries to use my Hive (1.2.1) with my Spark (Spark 1.4.1 built for Hadoop 2.2.0) I decided to try to build again Spark with Hive.
I would like to know what is the latest Hive version that can be used to build Spark at this point.
When downloading Spark 1.5 source and trying:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-1.2.1 -Phive-thriftserver -DskipTests clean package
I get :
The requested profile "hive-1.2.1" could not be activated because it does not exist.
Any help appreciated
Check your spark 1.5 pom.xml it contains hive 1.2.1 version therefore I don't thing you need to specify the hive version explicitly. Simply use mvn without hive version and it should work.
I'd recommend you to go through this compatibility chart :
http://hortonworks.com/wp-content/uploads/2016/03/asparagus-chart-hdp24.png
Spark website maintains good docs by version number regarding building with Hive support.
e.g. for v1.5 https://spark.apache.org/docs/1.5.0/building-spark.html
Listed example shows 2.4 but as the other answer pointed out above you can leave off the Phive-1.2.1 but according to the docs, if you do that with Spark 1.5.0 it will Build with Hive 0.13 Bindings by default.
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
Index of all versions: https://spark.apache.org/docs/
Latest version: https://spark.apache.org/docs/latest/building-spark.html
It appears that it defaults to Hive 1.2.1 bindings from Spark version 1.6.2 onwards. Default doesn't necessarily indicate support limitation though,
I pulled the latest source from the Spark repository and built locally. It works great from an interactive shell like spark-shell or spark-sql.
Now I want to connect Zeppelin to my Spark 1.5, according to this install manual. I published the custom Spark build to the local maven repository and set the custom Spark version in the Zeppelin build command. The build process finished successfully but when I try to run basic things like sc inside notebook, it throws:
akka.ConfigurationException: Akka JAR version [2.3.11] does not match the provided config version [2.3.4]
Version 2.3.4 is set in pom.xml and spark/pom.xml, but simply changing them won’t even let me get a build.
If I rebuild Zeppelin with the standard -Dspark.vesion=1.4.1, everything works.
Update 2016-01
Spark 1.6 support has landed to master and is available under -Pspark-1.6 profile.
Update 2015-09
Spark 1.5 support has landed to master and is available under -Pspark-1.5 profile.
Work on supporting Spark 1.5 in Apache Zeppelin (incubating) was done under this PR apache/incubator-zeppelin#269 which will lend to master soon.
For now, building from Spark_1.5 branch with -Pspark-1.5 should do the trick.
I'm attempting to build Apache Spark 1.1.0 on Windows 8.
I've installed all prerequisites (except Hadoop) and ran sbt/sbt assembly while in the root directory. After downloading many files, I'm getting an error after the line:
Set current project to root <in build file:C:/.../spark-0.9.0-incubating/>". The error is:
[error] Not a valid command: /
[error] /sbt
[error] ^
How to build Spark on Windows?
NOTE Please see my comment about the differences in versions.
The error Not a valid command: / comes from sbt that got executed and attempted to execute a command / (as the first character in /sbt string). It can only mean that you've got sbt shell script available in PATH (possibly installed separately outside the current working directory) or in the current working directory.
Just execute sbt assembly and it should build Spark fine.
According to the main page of Spark:
If you’d like to build Spark from scratch, visit building Spark with Maven.
that clearly states that the official build tool for Spark is now Maven (unfortunately).
You should be able to build a Spark package with the following command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
It worked fine for me.