getting CSV sink metrics files from spark-submit at run time - apache-spark

Having metrics.properties in /conf (enabling CSV sink) as follows (see configuration below), collects metrics every time you submit a job (using spark-submit) and it works by saving it to /tmp/
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
*.sink.csv.period=1
*.sink.csv.unit=minutes
# Polling directory for CsvSink
*.sink.csv.directory=/tmp/
# Worker instance overlap polling period
worker.sink.csv.period=1
worker.sink.csv.unit=minutes
Now I want to give metrics.properties file at run time (using the same configuration as above), and I gave the arguments for spark-submit as follows:
$spark_home/bin/spark-submit --files=file:///home/log_properties/metrics.properties --conf spark.metrics.conf=./metrics.properties --class com.myClass job1.jar
And I get the following warning and I don't have any Graphite configuration in my metrics.properties file (I just used the metrics.template and enabled the above csv configurations only)
WARN graphite.GraphiteReporter: Unable to report to Graphite
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:345)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at java.net.Socket.connect(Socket.java:538)
at java.net.Socket.<init>(Socket.java:434)
at java.net.Socket.<init>(Socket.java:244)
at javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
at com.codahale.metrics.graphite.Graphite.connect(Graphite.java:118)
at com.codahale.metrics.graphite.GraphiteReporter.report(GraphiteReporter.java:167)
at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162)
at org.apache.spark.metrics.sink.GraphiteSink.report(GraphiteSink.scala:91)
at org.apache.spark.metrics.MetricsSystem$$anonfun$report$1.apply(MetricsSystem.scala:114)
at org.apache.spark.metrics.MetricsSystem$$anonfun$report$1.apply(MetricsSystem.scala:114)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.metrics.MetricsSystem.report(MetricsSystem.scala:114)
at org.apache.spark.SparkContext$$anonfun$stop$3.apply$mcV$sp(SparkContext.scala:1715)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1219)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1714)
at org.apache.spark.SparkContext$$anonfun$3.apply$mcV$sp(SparkContext.scala:596)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:267)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:239)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:239)
Is it defaulted to report to Graphite and is ignoring my metrics.properties (which only is enabled for CSV sink)????

pass the conf like this -Dspark.metrics.conf=metrics.properties not via --conf spark.metrics.conf=./metrics.properties
Thats the reason why even though your file is added it is not used for the metrics config, it instead uses the default metrics.properties

yeah I realized I had metrics.properties file locally (from the directory where I run the spark-submit ) but what I passed i.e.
--files=file:///home/log_properties/metrics.properties in the spark-submit doesn't... while I resolved the issue by updating the local file (removing the Graphite flags). I am still puzzled on why it should care about the local file (metrics.properties) when I have already passed the metrics.properties that I want to use for my job.

Related

Pyspark - spark-submit to an AWS EMR

I have created an EMR cluster (emr-5.36.0) in AWS with the default sparks components (Spark 2.4.8, Hive 2.3.9).
I have installed Pyspark (3.3.0) on an EC2, in an python virtual environment.
From there, I would like to run "spark-submit" commands to the EMR cluster.
To test the command, I am using python the code at the bottom of this page
To configured the YARN_CONF_DIR environment variable on the EC2, I copied the yarn-site.xml file from /etc/hadoop/conf.empty/ on the EMR's master node to a folder on the EC2.
But now, on the EC2, when I try to run spark-submit, I get:
$ export YARN_CONF_DIR=/home/me/spark/
$ spark-submit --master yarn --deploy-mode cluster spark_test.py
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/javax/ws/rs/core/NoContentException
at org.apache.hadoop.yarn.util.timeline.TimelineUtils.<clinit>(TimelineUtils.java:60)
at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:200)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:191)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1327)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1764)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.javax.ws.rs.core.NoContentException
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 13 more 22/07/18 18:36:25 INFO ShutdownHookManager: Shutdown hook called
And from here I am basically lost. I tried to google the error but I am still not clear what the error is about. Did I miss a step? An environment variable maybe?
Ultimately, I want to use the SparkSubmitOperator in Airflow, but I figured I should get the "native" command to work first before using the operator (which is just a wrapper).
If you do YARN_CONF_DIR=/etc/hadoop_files/ locally, the content of the folder hadoop_files needs to be the content of the EMR's /etc/hadoop/ folder, not /etc/hadoop/conf.empty/.

Starting up Spark History Server to write to minIO

I'm trying to get Spark History Server to run on my cluster that is running on Kubernetes, and I'd like the logs to get written to minIO. I'm also using minIO as storage of the input and output of my spark-submit jobs, which is working already.
Currectly working spark-submit jobs
My working spark-submit job looks something like the following:
spark-submit \
--conf spark.hadoop.fs.s3a.access.key=XXXX \
--conf spark.hadoop.fs.s3a.secret.key=XXXX \
--conf spark.hadoop.fs.s3a.endpoint=https://someIpv4 \
--conf spark.hadoop.fs.s3a.connection.ssl.enabled=true \
--conf spark.hadoop.fs.s3a.path.style.access=true \
--conf spark.hadoop.fs.default.name="s3a:///" \
--conf spark.driver.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
--conf spark.executor.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
...
As you can see, I'm using SSL to connect to minIO and to read/write files.
What am I trying
I'm trying to spin up the history server with minIO as storage without using SSL.
To start up the history server, I'm using the already present start-history-server.sh script with some configs to define the log storage location with the ./start-history-server.sh --properties-file my_conf_file command. my_conf_file looks like this:
spark.eventLog.enabled=true
spark.eventLog.dir=s3a://myBucket/spark-events
spark.history.fs.logDirectory=s3a://myBucket/spark-events
spark.hadoop.fs.s3a.access.key=XXXX
spark.hadoop.fs.s3a.secret.key=XXXX
spark.hadoop.fs.s3a.endpoint=http://someIpv4
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=false
So you see I'm not adding any SSL parameters. But when I run ./start-history-server.sh --properties-file my_conf_file, I'm getting this error:
INFO AmazonHttpClient: Unable to execute HTTP request: Connection refused (Connection refused)
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:607)
at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:121)
at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326)
at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:296)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
What have I tried/found on the internet
This person had a very similar problem to mine, but it seems like they solved it using spark.hadoop.fs.s3a.path.style.access, which I'm already using
I was able to spin up History server using the local filesystem, so that seems to be working correctly
I have seen people, like in this post, using the spark.hadoop.fs.s3a.impl key with org.apache.hadoop.fs.s3a.S3AFileSystem as value. When I do this, however, It seems like this class doesn't exist within my AWS jars.
I have the following AWS jars at my disposal: aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
Since my spark-submit jobs are running fine, reading/writing away files to minIO, and I'm not supplying that spark.hadoop.fs.s3a.impl parameter in them I would think that that parameter is not needed?
Does anyone have an idea of where I should be looking/what I'm doing wrong?
My problem was that actually my minIO did not accept http requests. My already working spark submit job was using https using SSL, so I added the needed parameters to $SPARK_DAEMON_JAVA_OPTS and it was working.

spark on yarn java.io.IOException: No FileSystem for scheme: s3n

My english poor , sorry,but I really need help.
I use spark-2.0.0-bin-hadoop2.7 and hadoop2.7.3. and read log from s3, write result to local hdfs. and I can run spark driver use standalone mode successfully. But when I run the same driver on yarn mode. It's throw
17/02/10 16:20:16 ERROR ApplicationMaster: User class threw exception: java.io.IOException: No FileSystem for scheme: s3n
hadoop-env.sh I add
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*
run hadoop fs -ls s3n://xxx/xxx/xxx, can list files.
I thought it's should be can't find aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
how can do.
I'm not using the same versions as you, but here is an extract of my [spark_path]/conf/spark-defaults.conf file that was necessary to get s3a working:
# hadoop s3 config
spark.driver.extraClassPath [path]/guava-16.0.1.jar:[path]/aws-java-sdk-1.7.4.jar:[path]/hadoop-aws-2.7.2.jar
spark.executor.extraClassPath [path]/guava-16.0.1.jar:[path]/aws-java-sdk-1.7.4.jar:[path]/hadoop-aws-2.7.2.jar
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key [key]
spark.hadoop.fs.s3a.secret.key [key]
spark.hadoop.fs.s3a.fast.upload true
Alternatively you can specify paths to the jars in a comma-separated format to the --jars option on job submit:
--jars [path]aws-java-sdk-[version].jar,[path]hadoop-aws-[version].β€Œβ€‹β€Œβ€‹jar
Notes:
Ensure the jars are in the same location on all nodes in your cluster
Replace [path] with your path
Replace s3a with your preferred protocol (last time I checked s3a was best)
I don't think guava is required to get s3a working but I can't remember
Stick the JARs into SPARK_HOME/lib, with the rest of the spark bits.
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem isn't needed; the JAR will be autoscanned and picked up.
don't play with fast.output.enabled on 2.7.x unless you know what you are doing and prepared to tune some of the thread pool options. Start without that option.
Add these jars to $SPARK_HOME/jars:
ws-java-sdk-1.7.4.jar,hadoop-aws-2.7.3.jar,jackson-annotations-2.7.0.jar,jackson-core-2.7.0.jar,jackson-databind-2.7.0.jar,joda-time-2.9.6.jar

spark-submit classpath issue with --repositories --packages options

I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container.
When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder. This class is available in one of the dependencies downloaded by spark-submit. But doesn't look like it's available on the worker classpath??
16/02/22 16:17:09 INFO SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.NoClassDefFoundError: kafka/serializer/StringDecoder
at com.my.spark.app.JavaDirectKafkaWordCount.main(JavaDirectKafkaWordCount.java:71)
... 6 more
Caused by: java.lang.ClassNotFoundException: kafka.serializer.StringDecoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 7 more
The spark-submit call:
${SPARK_HOME}/bin/spark-submit --deploy-mode cluster \
--master spark://spark-master:7077 \
--repositories https://oss.sonatype.org/content/groups/public/ \
--packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0,org.elasticsearch:elasticsearch-spark_2.10:2.2.0 \
--class com.my.spark.app.JavaDirectKafkaWordCount \
/app/spark-app.jar kafka-server:9092 mytopic
I was working with Spark 2.4.0 when I ran into this problem. I don't have a solution yet but just some observations based on experimentation and reading around for solutions. I am noting them down here just in case it helps some one in their investigation. I will update this answer if I find more information later.
The --repositories option is required only if some custom repository has to be referenced
By default the maven central repository is used if the --repositories option is not provided
When --packages option is specified, the submit operation tries to look for the packages and their dependencies in the ~/.ivy2/cache, ~/.ivy2/jars, ~/.m2/repository directories.
If they are not found, then they are downloaded from maven central using ivy and stored under the ~/.ivy2 directory.
In my case I had observed that
spark-shell worked perfectly with the --packages option
spark-submit would fail to do the same. It would download the dependencies correctly but fail to pass on the jars to the driver and worker nodes
spark-submit worked with the --packages option if I ran the driver locally using --deploy-mode client instead of cluster.
This would run the driver locally in the command shell where I ran the spark-submit command but the worker would run on the cluster with the appropriate dependency jars
I found the following discussion useful but I still have to nail down this problem.
https://github.com/databricks/spark-redshift/issues/244#issuecomment-347082455
Most people just use an UBER jar to avoid running into this problem and even to avoid the problem of conflicting jar versions where a different version of the same dependency jar is provided by the platform.
But I don't like that idea beyond a stop gap arrangement and am still looking for a solution.

Spark SQL Thrift Server on CDH 5.3.0

I am trying to use CDH 5.3.0 to run Spark's Thrift Server. I'm trying to follow the Spark SQL instructions, but I can't even get the --help option to run successfully. In the output below, it dies because it can't find the HiveServer2 class.
$ /usr/lib/spark/sbin/start-thriftserver.sh --help
Usage./sbin/start-thriftserver [options] [thrift server options]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver
and executor classpaths.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
Spark standalone with cluster deploy mode only:
--driver-cores NUM Cores for driver (Default: 1).
--supervise If given, restarts the driver on failure.
Spark standalone and Mesos only:
--total-executor-cores NUM Total cores for all executors.
YARN-only:
--executor-cores NUM Number of cores per executor (Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
Thrift server options:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hive/service/server/HiveServer2
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)
Caused by: java.lang.ClassNotFoundException: org.apache.hive.service.server.HiveServer2
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 13 more
As indicated by the error, the class is not in the classpath. Unfortunately, setting the CLASSPATH environment variable won't work. The only solution that I could find was to edit /usr/lib/spark/bin/compute-classpath.sh and add this line (it can go just about anywhere, but put it one line from the end to make it clear that it's an addition):
CLASSPATH="$CLASSPATH:/usr/lib/hive/lib/*"
Cloudera's release notes for 5.3.0 explicitly state "Spark SQL remains an experimental and unsupported feature in CDH", so it's not surprising that tweaks like this may be needed. Also, this response to a similar problem in CDH 5.2 suggests that the Hive jars are deliberately excluded by Cloudera for size reasons.
I have faced the same problem but I solved it in another way.
The cloudera CDH version was not 5.3.0 it was some version prior to that version so you will find the paths little different.
Simply the solution was to replace the spark-assembly-**.jar file that shipped with cloudera CDH by another version.
I downloaded spark from its official download page. The version I have downloaded was built for hadoop 2.4 and later. Extracting the downloaded file and look for spark-assembly-**.jar.
In the cloudera installation, I looked for the same file and I found it under that path /usr/lib/spark/libe/spark-assembly--.jar
The previous path actually was a symlink to the actual file. I uploaded the jar from spark download to the same path and make the symlink point to the new jar (ln -f -s target link).
Every thing works fine with me.
/usr/lib/spark/bin/compute-classpath.sh sets CLASSPATH="$SPARK_CLASSPATH". On CDH using parcels you can add the hive jars to SPARK_CLASSPATH like this:
SPARK_CLASSPATH=$(ls -1 /opt/cloudera/parcels/CDH/lib/hive/lib/*.jar | sed -e :a -e 'N;s/\n/:/;ta') /opt/cloudera/parcels/CDH/lib/spark/sbin/start-thriftserver.sh --help
Instructions from Cloudera Community forum
http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/CDH-5-5-does-not-have-Spark-Thrift-Server/m-p/41849#M1758 :
git clone https://github.com/cloudera/spark.git
cd spark
./make-distribution.sh -DskipTests \
-Dhadoop.version=2.6.0-cdh5.7.0 \
-Phadoop-2.6 \
-Pyarn \
-Phive -Phive-thriftserver \
-Pflume-provided \
-Phadoop-provided \
-Phbase-provided \
-Phive-provided \
-Pparquet-provided
-Phive and -Phive-thriftserver are the key pieces there.
There is a request to add Spark Thrift Server
https://issues.cloudera.org/browse/DISTRO-817
please vote up if you want to see that in CDH.

Resources