Tachyon: Failed to rename during copyFromLocal command - apache-spark

I'm using Apache Spark to build an application. To make the RDDs available from other applications I'm trying two approaches:
Using tachyon
Using a spark-jobserver
I'm new to Tachyon. I completed the following tasks given in the a Running Tachyon on a Cluster
I'm able to access the UI from master:19999 URL.
From the tachyon directory I successfully created a directory./bin/tachyon tfs mkdir /Test
But while trying to do the copyFromLocal command I'm getting the following errors:
FailedToCheckpointException(message:Failed to rename hdfs://master:54310/tmp/tachyon/workers/1421840000001/8/93 to hdfs://master:54310/tmp/tachyon/data/93)

You are most likely running tachyon and spark-jobserver under different users, and have HDFS as your underFS.
Check out https://tachyon.atlassian.net/browse/TACHYON-1339 and the related patch.
The easy way out is running tachyon and your spark job server as the same user.
The (slightly) harder way is to port the patch and recompile spark, and then sjs with the patched client.

Related

Using different version of hadoop client library with apache spark

I'm trying to run two or more jobs in parallel. All jobs write append data using same output path, problem is that first job that finishes does cleanup and erases _temporary folder which causes other jobs to throw exception.
With hadoop-client 3 there is a configuration flag to disable auto cleanup of this folder mapreduce.fileoutputcommitter.cleanup.skipped.
I was able to exclude dependencies from spark-core and add new hadoop-client using maven. This run fine for master=local but I'm not convinced it is correct.
My questions are
Is it possible to use different hadoop-client library with apache spark (e.g. hadoop-client version 3 with apache spark 2.3) and what is the correct approach?
Is there better way to run multiple jobs in parallel writing under same path?

What setup is needed to use the Spark Cassandra Connector with Spark Job Server

I am working with Spark and Cassandra and in general things are straight forward and working as intended; in particular the spark-shell and running .scala processes to get results.
I'm now looking at utilisation of the Spark Job Server; I have the Job Server up and running and working as expected for both the test items, as well as some initial, simple .scala developed.
However I now want to take one of the .scala programs that works in spark-shell and get it onto the Spark Job Server to access via that mechanism. The issue I have is that the Job Server doesn't seem to recognise the import statements around cassandra and fails to build (sbt compile; sbt package) a jar for upload to the Job Server.
At some level it just looks like I need the Job Server equivalent to the spark shell package switch (--packages datastax:spark-cassandra-connector:2.0.1-s_2.11) on the Spark Job Server so that import com.datastax.spark.connector._ and similar code in the .scala files will work.
Currently when I attempt to build (sbt complie) I get message such as:
[error] /home/SparkCassandraTest.scala:10: object datastax is not a member of package com
[error] import com.datastax.spark.connector._
I have added different items to the build.sbt file based on searches and message board advice; but no real change; if that is the answer I'm after what should be added to the base Job Server to enable that usage of the cassandra connector.
I think that you need spark-submit to do this. I am working with Spark and Cassandra also, but only since one month; so I've needed read a lot of information. I had compiled this info in a repository, maybe this could help you, however is an alpha version, sorry about that.

Zeppelin - Unable to instantiate SessionHiveMetaStoreClient

I am trying to get Zeppelin to work. But when I run a notebook twice, the second time it fails due to Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient. (full log at the end of the post)
It seems to be due to the fact that the lock in the metastore doesn't get removed. It is also advised to use for example Postgres instead of Hive as it allows multiple users to run jobs in Zeppelin.
I made a postgres DB and a hive-site.xml pointing to this DB. I added this file into the config folder of Zeppelin but also into the config folder of Spark. Also in the jdbc interpreter of Zeppelin I added similar parameters than the ones in the hive-site.xml.
The problems persists though.
Error log: http://pastebin.com/Jqf9cdtU
hive-site.xml: http://pastebin.com/RZdXHPX4
Try using Thrift server architecture in the Spark setup instead of working on a single instance JVM of Hive where you cannot generate multiple of sessions.
There are mainly three types of connection to Hive:
Single JVM - Metastore stored locally in the warehouse which doesn't allow multiple sessions
Mutiple JVM - where each worker behaves as a metastore
Thrift Server Architecture - Multiple Users can access the SQL engine and parallelism can be achieved
Another instance of Derby may have already booted the database
By default, spark use derby as the metadata store which can only serve one user. It seems you start multiple spark interpreter, that's why you see the above error message. So here's the 2 solutions for you
Disable hive in spark interpreter via setting zeppelin.spark.useHiveContext to false if you don't need hive.
Set up hive metadata store which support multiple users. Refer this https://www.cloudera.com/documentation/enterprise/5-8-x/topics/cdh_ig_hive_metastore_configure.html
Stop Zeppelin. Go to your bin folder in Apache Zeppelin and try deleting metastore_db
sudo rm -r metastore_db/
Start Zeppelin again and try now.

Does EMR still have any advantages over EC2 for Spark?

I know this question has been asked before but those answers seem to revolve around Hadoop. For Spark you don't really need all the extra Hadoop cruft. With the spark-ec2 script (available via GitHub for 2.0) your environment is prepared for Spark. Are there any compelling use cases (other than a far superior boto3 sdk interface) for running with EMR over EC2?
This question boils down to the value of managed services, IMHO.
Running Spark as a standalone in local mode only requires you get the latest Spark, untar it, cd to its bin path and then running spark-submit, etc
However, creating a multi-node cluster that runs in cluster mode requires that you actually do real networking, configuring, tuning, etc. This means you've got to deal with IAM roles, Security groups, and there are subnet considerations within your VPC.
When you use EMR, you get a turnkey cluster in which you can 1-click install many popular applications (spark included), and all of the Security Groups are already configured properly for network communication between nodes, you've got logging already setup and pointing at S3, you've got easy SSH instructions, you've got an already-installed apparatus for tunneling and viewing the various UI's, you've got visual usage metrics at the IO level, node level, and job submission level, you also have the ability to create and run Steps -- which are jobs that can be run in the command line of the drive node or as Spark applications that leverage the whole cluster. Then, on top of that, you can export that whole cluster, steps included, and copy paste the CLI script into a recurring job via DataPipeline and literally create an ETL pipeline in 60 seconds flat.
You wouldn't get any of that if you built it yourself in EC2. I know which one I would choose... EMR. But that's just me.

How to modify spark source code and build

I just start learning spark. I have imported spark source code to IDEA and made some small changes (just add some println()) to spark source code. What should I do to see these updates? Should I recompile the spark? Thanks!
At the bare minimum, you will need maven 3.3.3 and Java 7+.
You can follow the steps at http://spark.apache.org/docs/latest/building-spark.html
The "make-distribution.sh" script is quite handy which comes within the spark source code root directory. This script will produce a distributable tar.gz which you can simply extract and launch spark-shell or spark-submit. After making the source code changes in spark, you can run this script with the right options (mainly passing the desired hadoop version, yarn or hive support options but these are required if you want to run on top of hadoop distro, or want to connect to existing hive).
BTW, inserting println() will not be a good idea as it can severely slow down the performance of the job. You should use a logger instead.

Resources