spark-ec2 and Tachyon hadoop version disparity

spark-ec2 and Tachyon hadoop version disparity - apache-spark

I try to use spark-ec2 to launch ec2 cluster with hadoop version 2.x, so I tried:
./spark-ec2 -k spark -i ~/.ssh/spark.pem -s 1 --hadoop-major-version=2 launch my-spark-cluster
then I found out there are error in the tachyon setting up process:
Setting up tachyon
RSYNC'ing /root/tachyon to slaves...
ec2-52-1-147-16.compute-1.amazonaws.com
ec2-52-1-147-16.compute-1.amazonaws.com: Formatting Tachyon Worker # ip-172-31-21-86.ec2.internal
ec2-52-1-147-16.compute-1.amazonaws.com: Removing local data under folder: /mnt/ramdisk/tachyonworker/
Formatting Tachyon Master # ec2-52-1-14-186.compute-1.amazonaws.com
Formatting JOURNAL_FOLDER: /root/tachyon/libexec/../journal/
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4
at tachyon.util.CommonUtils.runtimeException(CommonUtils.java:246)
at tachyon.UnderFileSystemHdfs.<init>(UnderFileSystemHdfs.java:73)
at tachyon.UnderFileSystemHdfs.getClient(UnderFileSystemHdfs.java:53)
at tachyon.UnderFileSystem.get(UnderFileSystem.java:53)
at tachyon.Format.main(Format.java:54)
Caused by: org.apache.hadoop.ipc.RemoteException: Server IPC version 7 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy1.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at tachyon.UnderFileSystemHdfs.<init>(UnderFileSystemHdfs.java:69)
... 3 more
I've searched for some related question and it seems that Server IPC version 7 cannot communicate with client version 4 means that server is using hadoop 2.x and client is using hadoop 1.x. However, I built my spark with hadoop 2.4.0 and I also tried the official spark pre-built version with hadoop 2.4.0 and later, both lead to the same error.
By the way, hadoop version created by setting --hadoop-major-version=2 is Hadoop 2.0.0-cdh4.2.0. Is this a problem? But I tried to use 2.4 or 2.4.0 here, neither of them are recognized as valid hadoop version

Related

Apache PySpark - Failed to connect to master 7077

I set up Spark and HDFS after watching this video. The only difference is that I did it on a server (ubuntu) and not on a VM.
On the server, everything works perfect. Now I wanted to access it from my local machine (Windows) with PySpark.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://ubuntu-spark:7077").appName("test").getOrCreate()
spark.stop()
However, here I get the following error messages:
22/11/12 10:38:35 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException:
java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see
https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
22/11/12 10:38:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
22/11/12 10:38:37 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master
ubuntu-spark:7077
org.apache.spark.SparkException: Exception thrown in awaitResult: ...
According to other posts, the DNS should be correct. I got this from the Spark Master website (at port 8080):
URL: spark://ubuntu-spark:7077
Alive Workers: 1
Cores in use: 2 Total, 0 Used
Memory in use: 6.8 GiB Total, 0.0 B Used
Resources in use:
Applications: 0 Running, 0 Completed
Drivers: 0 Running, 0 Completed
Status: ALIVE
The ports are open. I also don't understand the following message: "HADOOP_HOME and hadoop.home.dir are unset." Hadoop is configured on the server. Why should I do the same thing locally again? My expectation would be that I can use Spark like an API or am I wrong?
Thank you very much for your help. If you need any configuration files I can provide them.

Hadoop should not be necessary for the code shown since you're not using HDFS, but the log is saying it's looking on your Windows machine for those settings.
DNS needs to work between your windows machine and wherever your server is running (a VM can still be server, so it's unclear where you're running this). Start debugging with ping spark-master to check, or you should be able to open spark-master:8080 from Windows browser instance as well.
If you only want to run Spark code, and don't care if it's distributed, you could just use Docker on Windows - https://github.com/jupyter/docker-stacks
Or setup Pycharm all locally for the same

How can I run spark in headless mode in my custom version on HDP?

How can I run spark in headless mode?
Currently, I am executing spark on a HDP 2.6.4 (i.e. 2.2 is installed by default) on the cluster.
I have downloaded a spark 2.4.1 Scala 2.11 release in headless mode (i.e. no hadoop jars are built in) from https://spark.apache.org/downloads.html. The exact name is: pre-built with scala 2.11 and user provided hadoop
Now when trying to run I follow: https://spark.apache.org/docs/latest/hadoop-provided.html
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/home/<<my_user>>/development/software/spark_no_provided_hadoop
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_yarn_queue>>
Unfortunately, it fails to start:
19/05/01 07:12:23 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/05/01 07:12:38 ERROR cluster.YarnClientSchedulerBackend: The YARN application has already ended! It might have been killed or the Application Master may have failed to start. Check the YARN application logs for more details.
19/05/01 07:12:38 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Application application_1555489055691_64276 failed 2 times due to AM Container for appattempt_1555489055691_64276_000002 exited with exitCode: 1
When looking at the logs for details I see:
Log Type: prelaunch.err
launch_container.sh: line 30: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:/etc/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/*:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:/usr/hdp/2.6.4.0-91/hadoop/conf:/usr/hdp/2.6.4.0-91/hadoop/lib/*:/usr/hdp/2.6.4.0-91/hadoop/.//*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/./:/usr/hdp/2.6.4.0-91/hadoop-hdfs/lib/*:/usr/hdp/2.6.4.0-91/hadoop-hdfs/.//*:/usr/hdp/2.6.4.0-91/hadoop-yarn/lib/*:/usr/hdp/2.6.4.0-91/hadoop-yarn/.//*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/lib/*:/usr/hdp/2.6.4.0-91/hadoop-mapreduce/.//*:/usr/hdp/2.6.4.0-91/tez/*:/usr/hdp/2.6.4.0-91/tez/lib/*:/usr/hdp/2.6.4.0-91/tez/conf:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
So:
/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar: bad substitution
is the cause (and similar to https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html), but this is completely inside Ambari's management domain. How can I work around it to run a more recent version of spark (2.4.x) on the existing 2.6.x HDP plattform?
edit
Assuming I passed a wrong configuration directory for HADOOP_CONF_DIR, it is unset. But then:
When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.
so it must be passed. Could it be, that I am passing the wrong value?
According to Exception: java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. in spark could be correct. For me, no HADOOP_HOME is set by default.
Even when setting to: export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf, the same bad substitution error remains.
NOTE: some interesting steps:
https://community.hortonworks.com/articles/244059/steps-to-install-supplementary-spark-on-hdp-cluste.html, but not for the headless edition
https://community.hortonworks.com/questions/85757/how-to-add-the-hadoop-and-yarn-configuration-file.html

Indeed, https://community.hortonworks.com/questions/23699/bad-substitution-error-running-spark-on-yarn.html is the solution:
cd /usr/hdp
ls
2.6.xxx current share
So for me:
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_queue>>--conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.xxx'
works

%spark.r interpreter is not working in Zeppelin 0.6.1

I am having Spark 1.6.2 cluster with Hadoop YARN, Oozie. I have installed Zeppelin 0.6.1(Binary package with all interpreters: zeppelin-0.6.1-bin-all.tgz). When I am trying to use SparkR script with %spark.r interpreter,
%spark.r
# Creating SparkConext and connecting to Cloudant DB
sc1 <- sparkR.init(sparkEnv = list("cloudant.host"="host_name","cloudant.username"="user_name","cloudant.password"="password", "jsonstore.rdd.schemaSampleSize"="-1"))
# Database to be connected to extract the data
database <- "sensordata"
# Creating Spark SQL Context
sqlContext <- sparkRSQL.init(sc)
# Creating DataFrame for the "sensordata" Cloudant DB
sensorDataDF <- read.df(sqlContext, database, header='true', source = "com.cloudant.spark",inferSchema='true')
# Get basic information about the DataFrame(sensorDataDF)
printSchema(sensorDataDF)
I am getting the following error(log):
ERROR [2016-08-25 03:28:37,336] (
{Thread-77}
JobProgressPoller.java[run]:54) - Can not get or update progress
org.apache.zeppelin.interpreter.InterpreterException: org.apache.thrift.transport.TTransportException
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:373)
at org.apache.zeppelin.interpreter.LazyOpenInterpreter.getProgress(LazyOpenInterpreter.java:111)
at org.apache.zeppelin.notebook.Paragraph.progress(Paragraph.java:237)
at org.apache.zeppelin.scheduler.JobProgressPoller.run(JobProgressPoller.java:51)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_getProgress(RemoteInterpreterService.java:296)
at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.getProgress(RemoteInterpreterService.java:281)
at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.getProgress(RemoteInterpreter.java:370)
... 3 more
Help would be much appreciated.

I faced the same similar issue after migrating to 0.6.1. The issue is Zeppelin is built with scala 2.11 and Apache Spark 1.6.2 is built with scala 2.10.
You need to build spark 1.6.x with scala 2.11 or migrate your spark code to 2.0.0

Setting local[2] in the interpreter section fixed my issues. This was originally suggested by vgunnu
"Try setting spark master as local[2], if that works, you might be missing few environmental variables in env file – vgunnu Aug 25 at 4:37"

Cassandra Java Driver error - All host(s) tried for query failed Connection has been closed

All,
I have a 3 noded cluster cassandra in Digital Ocean . the version of cassandra as per SHOW VERSION in CQL is shown below
[cqlsh 5.0.1 | Cassandra 3.0.0 | CQL spec 3.3.1 | Native protocol v4]
I am able to connnect to one node of the cluster from another node using cqlsh and run commands... However when i try to connect using the java driver , i get the following exception
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /128.xxx.xxx.xx:9042 (com.datastax.driver.core.TransportException: [/128.xxx.xxx.xxx:9042] Connection has been closed))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:222)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:77)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1232)
at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:336)
at com.attinad.cantiz.iot.platform.vehicledatapoc.App.connect(App.java:22)
at com.attinad.cantiz.iot.platform.vehicledatapoc.App.main(App.java:14)
The version of java driver that i am using is 2.0.10. The maven configuration is given below
<dependency>
<groupId>com.datastax.cassandra</groupId>
<artifactId>cassandra-driver-core</artifactId>
<version>2.0.10</version>
</dependency>
I checked the cassandra.yaml and found that the following settings has been set correctly
start_native_transport: true
native_transport_port: 9042
rpc_address: 128.xxx.xxx.xx
listen_address: 128.xxx.xxx.xx
has been configured correctly... So i am completely lost... Any help is appreciated

According to the 2.0.10 driver documentation, that version of the driver is compatible with Apache Cassandra 1.2 and 2.0. Compatibility with 3.0 is added in the 3.0 driver, which is currently at 3.0.0-beta1. The protocol compatibility error should be shown in the Cassandra server logs.
You could either downgrade Cassandra to a 2.x version or try out the beta driver. Downgrading Cassandra should be the safer choice if you want to use the system in production now.

Ran into this myself... had tried Cassandra 3.0.0 with driver 2.1.9.
Fixed it by going to driver 3.0.0.

DSE Spark Shell Authentication

I have a DSE 4.5 installation with spark running. I need some help in passing the username / password of cassandra cluster from Spark Shell.
I have added these properties to conf/spark-default.conf file
spark.cassandra.auth.username=user
spark.cassandra..auth.password=pass
And start up my spark shell using
dse spark
But still seeing the error when I try sc.cassandraTable
com.datastax.driver.core.exceptions.AuthenticationException: Authentication error on host /11.111.11.11:9042: Host /11.111.11.11:9042 requires authentication, but no authenticator found in Cluster configuration
at com.datastax.driver.core.AuthProvider$1.newAuthenticator(AuthProvider.java:38)
at com.datastax.driver.core.Connection.initializeTransport(Connection.java:138)
at com.datastax.driver.core.Connection.<init>(Connection.java:111)
at com.datastax.driver.core.Connection$Factory.open(Connection.java:432)
at com.datastax.driver.core.ControlConnection.tryConnect(ControlConnection.java:216)
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:171)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:79)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1104)

looks like you can execute this command
dse spark -Dcassandra.username=user -Dcassandra.password=pass
ref:
http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/sec/secIntrnlAuth.html?scroll=secItrnlAuth__authentication-for-hadoop-tools

This worked for me:
dse -u cassandra -p cassandra spark

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string