Unable to connect to spark-sql cli - apache-spark

I am using CDH 5.5.7 quick start VM which has Spark 1.6.0 running. I am trying to connect to the spark-sql cli but it fails.
According to this link by issuing spark-sql command I should be able to enter the cli but I get the below error.
[cloudera#quickstart ~]$ spark-sql
-bash: spark-sql: command not found
I have also tried the below and getting the same error
[cloudera#quickstart ~]$ ./bin/spark-sql
-bash: ./bin/spark-sql: No such file or directory
Any help is much appreciated.

This probably will not work in Cloudera's distribution of Spark.
I think they stopped shipping spark-sql since CDH 5.4.
spark-sql is not included because CDH Spark doesn't have Thift service or because of some other reason.
I can't find confirmation in online documentation, but my CDH 5.8 doesn't have spark-sql in neither Spark 1.6 nor Spark 2.0 parcels.

Related

cql-import tool not present in sqoop 1.4.6

I am currently stuck on Data migration, I want to migrate data from Oracle Database to Cassandra.
I have following tools installed on Linux
DSE 4.8
Hadoop 2.7.3
Sqoop 1.4.6
I am not sure why my SQOOP version is not having cql-import or any cassandra related commands.
Following are the available commands I can see in the "SQOOP help" output
Available commands:
codegen
create-hive-table
eval
export
help
import
import-all-tables
import-mainframe
job
list-databases
list-tables
merge
metastore
version
I have searched throughout the net and found following link having latest sqoop version, but cql-import tool is missing in all of them.
https://www-eu.apache.org/dist/sqoop/
http://mirrors.ibiblio.org/apache/sqoop/1.4.6/
It would be very helpful if any one has the link for a sqoop version which supports cassandra data migration commands like "cql-import".
Editted:
One more point to add, I have manually configured Hadoop and Sqoop.
Thanks in advance

Spark notebook in Hue 3.11

I am trying to setup Spark notebook in HUE(version 3.11) with Spark 2.0.0 using Livy 0.2.0.
With Spark 1.6.1 the notebook is working perfectly fine.
Livy only supports Scala 2.10 builds of Spark.So I did a build of Spark-2.0.0 with Scala-2.10.6.When I open up spark-shell(2.0.0) it clears says "Using Scala version 2.10.6".
But Spark notebook is not working with this build.In the Spark notebook when I do 1+1 and execute it , it gives the following error.
What could be wrong here?Below is the exception in the logs
"java.util.concurrent.ExecutionException: com.cloudera.livy.rsc.rpc.RpcException: java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.render(Lorg/json4s/JsonAST$JValue;)Lorg/json4s/JsonAST$JValue;\ncom.cloudera.livy.repl.ReplDriver$$anonfun$handle$2.apply(ReplDriver.scala:78)\ncom.cloudera.livy.repl.ReplDriver$$anonfun$handle$2.apply(ReplDriver.scala:78)\nscala.Option.map(Option.scala:145)\ncom.cloudera.livy.repl.ReplDriver.handle(ReplDriver.scala:78)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:606)\ncom.cloudera.livy.rsc.rpc.RpcDispatcher.handleCall(RpcDispatcher.java:130)\ncom.cloudera.livy.rsc.rpc.RpcDispatcher.channelRead0(RpcDispatcher.java:77)\nio.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)\nio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)\nio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)\nio.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)\nio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)\nio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)\nio.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)\nio.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)\nio.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)\nio.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)\nio.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)\nio.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)\nio.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)\nio.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)\nio.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)\nio.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)\nio.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)\njava.lang.Thread.run(Thread.java:745)" (error 500)
This solved my problem.
Download latest Livy code from Git hub.Use the below maven build command
mvn clean package -DskipTests -Dspark-2.0 -Dscala-2.11
I'm not sure that this is even possible.
Relying on release notes, Hue 3.11 not works with Spark 2.0 (it works with Spark 1.6).

Failed to find data source: com.stratio.datasource.mongodb

I read all different issues from other Stratio packages, but I couldn't solve my problem.
When I try with the:
"./bin/spark-submit --packages com.stratio.datasource:spark-mongodb_2.11:0.12.0 "
or
"./bin/spark-submit --jars /home/user/Spark-MongoDB/spark-mongodb_2.11/target/spark-mongodb_2.11-0.12.1-RC1-SNAPSHOT.jar" .
I've been struggling with this for the past two days, what is it Im doing wrong? I'm using spark 2.0.0, Ubuntu 14.04.
This is the command I am using on my test machine and it is working fine.
spark-submit --packages com.stratio.datasource:spark-mongodb_2.11:0.12.0 --master local[1] Cell.py
I have the same environment i.e. Ubuntu 14.04, scala 2.11 and spark 2.0.0
I am trying to write data to mongodb using my python program and it is working as expected.

Running R on amazon EMR with spark 1.6 and Zeppelin 0.5.6

I am trying to setup the R interpreter to run in Zeppelin which is currently running on EMR. Zeppelin is working perfectly and I am able to write script in Scala and Python. When I use %r, %sparkR or %knitr I receive an error : "r interpreter not found"
The applications which I have running in my emr-4.7.2 cluster are: Hive 1.0.0, Zeppelin-Sandbox 0.5.6, Spark 1.6.2, Pig 0.14.0
Within the interpreter there is no mention of R so figure I am missing something but do not know what.
Any pointers greatly appreciated.
Zeppelin on Amazon EMR (till at least emr-5.0.0) does not support the SparkR interpreter.
You ought following the Elastic Map Reduce Release Guide/Zeppelin documentation to get more information.

Connecting SparkR to the spark cluster

I have a spark cluster running on 10 machines (1 - 10) with the master at machine 1. All of these run on CentOS 6.4.
I am trying to connect a jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), using sparkR, to the cluster and get the spark context.
The code I am using is
Sys.setenv(SPARK_HOME="/usr/local/spark-1.4.1-bin-hadoop2.4")
library(SparkR)
sc <- sparkR.init(master="spark://<master-ip>:7077")
The output I get is
attaching package: ‘SparkR’
The following object is masked from ‘package:stats’:
filter
The following objects are masked from ‘package:base’:
intersect, sample, table
Launching java with spark-submit command spark-submit sparkr-shell/tmp/Rtmpzo6esw/backend_port29e74b83c7b3 Error in sparkR.init(master = "spark://10.10.5.51:7077"): JVM is not ready after 10 seconds
Error in sparkRSQL.init(sc): object 'sc' not found
I am using Spark 1.4.1. The spark cluster is also running CDH 5.
The jupyterhub installation can connect to the cluster via pyspark and I have python notebooks which use pyspark.
Can someone tell me what I am doing wrong?
I have a similar problem and have searching all around but no solutions. Can you please tell me what do you mean by "jupyterhub installation (which is running inside a ubuntu docker because of issues with installing on CentOS), "?
We have 4 clusters too on CentOS 6.4. One of my other problem is that how do use an IDE like IPython or RStudio to interact with these 4 servers? Do I use my laptop to connect to these servers remotely (if yes, then how?) and if no then what can be the other solution.
Now to answer your question, I can give it a try. I think the you have to use --yarn-cluster option as stated here I hope this helps you solving the problem.
Cheers,
Ashish

Resources