./spark-shell doesn't start correctly (spark1.6.1-bin.hadoop2.6 version) - apache-spark

I installed this spark version: spark-1.6.1-bin-hadoop2.6.tgz.
Now when I start spark with ./spark-shell command Im getting this issues (it shows a lot of error lines so I just put some that seems important)
Cleanup action completed
16/03/27 00:19:35 ERROR Schema: Failed initialising database.
Failed to create database 'metastore_db', see the next exception for details.
org.datanucleus.exceptions.NucleusDataStoreException: Failed to create database 'metastore_db', see the next exception for details.
at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
Caused by: java.sql.SQLException: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 more
Caused by: ERROR XBM0H: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
Nested Throwables StackTrace:
java.sql.SQLException: Failed to create database 'metastore_db', see the next exception for details.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 more
Caused by: ERROR XBM0H: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
at org.apache.derby.iapi.error.StandardException.newException
Caused by: java.sql.SQLException: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
at
... 128 more
<console>:16: error: not found: value sqlContext
import sqlContext.implicits._
^
<console>:16: error: not found: value sqlContext
import sqlContext.sql
^
scala>
I tried some configurations to fix this issue that I search in other questions about the value sqlContext not found issue, like:
/etc/hosts file:
127.0.0.1 hadoophost localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.2.0.15 hadoophost
echo $HOSTNAME returns:
hadoophost
.bashrc file contains:
export SPARK_LOCAL_IP=127.0.0.1
But dont works, can you give some help to try understand why spark is not starting correctly?
hive-default.xml.template
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--><configuration>
<!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
In the home folder I get the same issues:
[hadoopadmin#hadoop home]$ pwd
/home
[hadoopadmin#hadoop home]$
Folder permissions:
[hadoopdadmin#hadoop spark-1.6.1-bin-hadoop2.6]$ ls -la
total 1416
drwxr-xr-x. 12 hadoop hadoop 4096 .
drwxr-xr-x. 16 root root 4096 ..
drwxr-xr-x. 2 hadoop hadoop 4096 bin
-rw-r--r--. 1 hadoop hadoop 1343562 CHANGES.txt
drwxr-xr-x. 2 hadoop hadoop 4096 conf
drwxr-xr-x. 3 hadoop hadoop 4096 data
drwxr-xr-x. 3 hadoop hadoop 4096 ec2
drwxr-xr-x. 3 hadoop hadoop 4096 examples
drwxr-xr-x. 2 hadoop hadoop 4096 lib
-rw-r--r--. 1 hadoop hadoop 17352 LICENSE
drwxr-xr-x. 2 hadoop hadoop 4096 licenses
-rw-r--r--. 1 hadoop hadoop 23529 NOTICE
drwxr-xr-x. 6 hadoop hadoop 4096 python
drwxr-xr-x. 3 hadoop hadoop 4096 R
-rw-r--r--. 1 hadoop hadoop 3359 README.md
-rw-r--r--. 1 hadoop hadoop 120 RELEASE
drwxr-xr-x. 2 hadoop hadoop 4096 sbin

Apparently you don't have permissions to write in that directory, I recommend you to run ./spark-shell in your HOME (you might want to add that command to your PATH), or in any other directory accessible and writable by your user.
This might also be relevant for you Notebooks together with Spark

You are using spark built with hive support.
There are two possible solutions based on what you want to do later with your spark-shell or in your spark jobs -
You want to access hive tables in your hadoop+hive installation.
You should place hive-site.xml in your spark installation's conf sub-directory. Find hive-site.xml from your existing hive installation. For example, in my cloudera VM the hive-site.xml is at /usr/lib/hive/conf. Launching the spark-shell after doing this step should successfully connect to existing hive metastore and will not try to create a temporary .metastore database in your current working directory.
You do NOT want to access hive tables in your hadoop+hive installation.
If you do not care about connecting to hive tables, then you can follow Alberto's solution. Fix the permission issues in the directory from which you are launching spark-shell. Make sure you are allowed to create directories/files in that directory.
Hope this helps.

Related

Reclaiming tables corrupted when hdfs volume was at 100%

I am using hadoop version Hadoop 2.7.0-mapr-1506 .
When data volume is at 100%, our jobs still tried to insert overwrite data to few hive tables and they are corrupted and gives the below exception when accessed,
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: maprfs:/hive/bigdata.db/cadfp_wmt_table
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:289)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
Now we have freed up space in the data volume and want to reclaim the data in the below tables, how can we achieve it
hadoop fs -ls /hive/bigdata.db/ | grep tmp~
drwxr-xr-x - bigdata bigdata 16 2019-04-05 07:38 /hive/bigdata.db/pc_bt_clean_table.tmp~!#
drwxr-xr-x - bigdata bigdata 209 2019-04-05 07:51 /hive/bigdata.db/pc_bt_table.tmp~!#
drwxr-xr-x - bigdata bigdata 1081 2019-04-05 07:38 /hive/bigdata.db/cadfp_wmt_table.tmp~!#
Tried steps mentioned here How to fix corrupt HDFS FIles but hdfs command does not work for me

The root scratch dir: /tmp/hive on HDFS should be writable Spark app error

I have created a Spark application which uses Hive metastore but in the line of the external Hive table creation, I get such an error when I execute the application (Spark driver logs):
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
I run the application using the Spark operator for K8s.
So I checked the permissions of the directories ob driver pod of the Spark application:
ls -l /tmp
...
drwxrwxr-x 1 1001 1001 4096 Feb 22 16:47 hive
If I try to change permissions it does not make any effect.
I run Hive metastore and HDFS in K8s as well.
How this problem can be fixed?
This is a common error which can be fixed by creating a directory at another place and pointing the spark to use the new dir.
Step 1: Create a new dir called tmpops at /tmp/tmpops
Step 2: Give permission for the dir chmod -777 /tmp/tmpops
Note: -777 is for local testing. If you are working with sensitive data make sure to add this path to security groups to avoid accidental data leakage and security loophole.
Step 3: Add the below property in your hive-site.xml that the spark app is referring to:
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/tmpops</value>
</property>
Once you do this, the error will no longer appear unless someone deletes that dir.
I face the same issue in window 10, below solution helped me to get this fixed.
Following steps solved my problem
Open Command Prompt in Admin Mode
winutils.exe chmod 777 /tmp/hive
Open Spark-Shell --master local[2]

new Spark StreamingContext failes with hdfs errors

I'm using dcos installed via Azure ACS and installed hdfs and spark via dcos tool with default options.
Creating a SparkStreamingContext gives:
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn1. Check your hdfs-site.xml file to ensure namenodes are configured properly.
16/07/22 01:51:04 WARN DFSUtil: Namenode for hdfs remains unresolved for ID nn2. Check your hdfs-site.xml file to ensure namenodes are configured properly.
Exception in thread "main" java.lang.IllegalArgumentException:
java.net.UnknownHostException: namenode1.hdfs.mesos
I expect I have to redeploy the spark package with dcos package install with –options= but can't figure out what the hdfs.config-url should be. The https://docs.mesosphere.com/1.7/usage/service-guides/spark/install/#hdfs docs seem out of date.
Yes, it is out of date. We'll fix that.
DC/OS HDFS now serves its config on http://hdfs.marathon.mesos:[port]/v1/connect

SAP HANA Vora shell can't find org.apache.spark.launcher.Main

I'm currently doing my first steps with SAP HANA Vora on Cloudera Express 5.5.0.
The Vora server is up and running and I would now like to use the Vora spark shell but this is what I get:
sh start-spark-shell.sh
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/launcher/Main
Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.launcher.Main. Program will exit.
This is how my environment looks like:
export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
export JAVA_HOME=/usr/java/default
export HADOOP_PARCEL_PATH=/opt/cloudera/parcels/CDH
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/usr/lib/spark
export SPARK_CONF_DIR=$SPARK_HOME/conf
export PATH=$PATH:$SPARK_HOME/bin
SPARK_DIST_CLASSPATH=$SPARK_HOME/lib/spark-assembly.jar
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop/lib/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop-hdfs/lib/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop-hdfs/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop-mapreduce/lib/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop-mapreduce/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop-yarn/lib/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hadoop-yarn/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/hive/lib/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/flume-ng/lib/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/parquet/lib/*
SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_PARCEL_PATH/lib/avro/lib/*
export SPARK_DIST_CLASSPATH
Solved it.
Just needed to upgrade Java from JDK6 to JDK7. Make sure you have the following environment variables set (check the values with your installation):
export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
export JAVA_HOME=/usr/java/default
export HADOOP_PARCEL_PATH=/opt/cloudera/parcels/CDH
export HADOOP_CONF_DIR=/etc/hadoop/conf
export SPARK_HOME=/usr/lib/spark
export SPARK_CONF_DIR=$SPARK_HOME/conf
export PATH=$PATH:$SPARK_HOME/bin
Thanks for answer https://stackoverflow.com/users/1867854/michael-kunzmann.
I found JDK7 already installed under /usr/java/jdk1.7.0_67-cloudera directory. I think it's a step in the Cloudera Manager installation. Minimum supported version is 1.7.0_55 for CDH 5.3, 5.5, 5.6 at time of writing (e.g. see https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_req_supported_versions.html#concept_pdd_kzf_vp)
I was trying with the stanard (non-Vora) spark-shell and got the same issue on CDH. So JDK7 is required for the standard spark-shell too. The Vora spark-shell script..
$VORA_SPARK_HOME/bin/start-spark-shell.sh
...just adds the Vora datasources jar as an additional library.
FYI here's an example for standard spark-shell on Cloudera CDH..
~> cd /usr/java
/usr/java> ls -l
total 8
lrwxrwxrwx 1 lroot root 16 Dec 17 2015 default -> /usr/java/latest
drwxr-xr-x 9 lroot root 4096 Dec 17 2015 jdk1.6.0_31
drwxr-xr-x 8 lroot root 4096 Dec 17 2015 jdk1.7.0_67-cloudera
lrwxrwxrwx 1 lroot root 21 Dec 17 2015 latest -> /usr/java/jdk1.6.0_31
/usr/java> export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
/usr/java> export SPARK_HOME=/usr/lib/spark
/usr/java> cd $SPARK_HOME
/usr/lib/spark> ./bin/spark-shell
FYI I also have Vora on HortonWorks. Java 7 was already on the PATH via /usr/bin/java sybolic link and this just worked..
source /etc/vora/vora-env.sh
$VORA_SPARK_HOME/bin/start-spark-shell.sh

About saving a model file in spark

I'm testing codes of Linear Support Vector Machines (SVMs) in below link:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms
I tested code with spark-shell --master spark://192.168.0.181:7077.
I modified last 2 lines like this:
model.save(sc, "file:///Apps/spark/data/mllib/testModelPath")
val sameModel = SVMModel.load(sc, "file:///Apps/spark/data/mllib/testModelPath")
model.save ended with no error but when I tried to load that model it gives following message:
INFO mapred.FileInputFormat: Total input paths to process : 0
java.lang.UnsupportedOperationException: empty collection
:
:
When I tested without file:/// , model saved to HDFS system, and I can load that model without error.
hadoop#XXX /Apps/spark/data/mllib/testModelPath/data> ll
drwxrwxr-x 2 hadoop hadoop 4096 2015-10-07 16:47 ./
drwxrwxr-x 4 hadoop hadoop 4096 2015-10-07 16:47 ../
-rw-rw-r-- 1 hadoop hadoop 8 2015-10-07 16:47 ._SUCCESS.crc
-rw-rw-r-- 1 hadoop hadoop 16 2015-10-07 16:47 ._common_metadata.crc
-rw-r--r-- 1 hadoop hadoop 0 2015-10-07 16:47 _SUCCESS
-rw-r--r-- 1 hadoop hadoop 932 2015-10-07 16:47 _common_metadata
When I checked the folder after model saving, I found _metadata file is not created.
Dose anyone know the reason of this?
I meet the same problem. The problem is caused by save and load function. If you run spark in multi nodes, when you save the model, the spark save API just saves the corresponding partition on the node, so the model on every node is incomplete. But the load function's source code use the textFile API to load the model, and textFile needs that the file is same on every node. So it cause the problem. The method I solved it is save the model on HDFS, while it wastes the space.

Resources