Reclaiming tables corrupted when hdfs volume was at 100% - apache-spark

I am using hadoop version Hadoop 2.7.0-mapr-1506 .
When data volume is at 100%, our jobs still tried to insert overwrite data to few hive tables and they are corrupted and gives the below exception when accessed,
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: maprfs:/hive/bigdata.db/cadfp_wmt_table
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:289)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
Now we have freed up space in the data volume and want to reclaim the data in the below tables, how can we achieve it
hadoop fs -ls /hive/bigdata.db/ | grep tmp~
drwxr-xr-x - bigdata bigdata 16 2019-04-05 07:38 /hive/bigdata.db/pc_bt_clean_table.tmp~!#
drwxr-xr-x - bigdata bigdata 209 2019-04-05 07:51 /hive/bigdata.db/pc_bt_table.tmp~!#
drwxr-xr-x - bigdata bigdata 1081 2019-04-05 07:38 /hive/bigdata.db/cadfp_wmt_table.tmp~!#
Tried steps mentioned here How to fix corrupt HDFS FIles but hdfs command does not work for me

Related

The root scratch dir: /tmp/hive on HDFS should be writable Spark app error

I have created a Spark application which uses Hive metastore but in the line of the external Hive table creation, I get such an error when I execute the application (Spark driver logs):
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
I run the application using the Spark operator for K8s.
So I checked the permissions of the directories ob driver pod of the Spark application:
ls -l /tmp
...
drwxrwxr-x 1 1001 1001 4096 Feb 22 16:47 hive
If I try to change permissions it does not make any effect.
I run Hive metastore and HDFS in K8s as well.
How this problem can be fixed?
This is a common error which can be fixed by creating a directory at another place and pointing the spark to use the new dir.
Step 1: Create a new dir called tmpops at /tmp/tmpops
Step 2: Give permission for the dir chmod -777 /tmp/tmpops
Note: -777 is for local testing. If you are working with sensitive data make sure to add this path to security groups to avoid accidental data leakage and security loophole.
Step 3: Add the below property in your hive-site.xml that the spark app is referring to:
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/tmpops</value>
</property>
Once you do this, the error will no longer appear unless someone deletes that dir.
I face the same issue in window 10, below solution helped me to get this fixed.
Following steps solved my problem
Open Command Prompt in Admin Mode
winutils.exe chmod 777 /tmp/hive
Open Spark-Shell --master local[2]

When does a Spark on YARN application exit with exitCode: -104?

My spark application reads 3 files of 7 MB , 40 MB ,100MB and so many transformations and store multiple directories
Spark version CDH1.5
MASTER_URL=yarn-cluster
NUM_EXECUTORS=15
EXECUTOR_MEMORY=4G
EXECUTOR_CORES=6
DRIVER_MEMORY=3G
My spark job was running for some time and then it throws the below error message and restarts from begining
18/03/27 18:59:44 INFO avro.AvroRelation: using snappy for Avro output
18/03/27 18:59:47 ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
18/03/27 18:59:47 INFO CuratorFrameworkSingleton: Closing ZooKeeper client.
Once it restarted again it ran for sometime and failed with this error
Application application_1521733534016_7233 failed 2 times due to AM Container for appattempt_1521733534016_7233_000002 exited with exitCode: -104
For more detailed output, check application tracking page:http://entline.com:8088/proxy/application_1521733534016_7233/Then, click on links to logs of each attempt.
Diagnostics: Container [pid=52716,containerID=container_e98_1521733534016_7233_02_000001] is running beyond physical memory limits. Current usage: 3.5 GB of 3.5 GB physical memory used; 4.3 GB of 7.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_e98_1521733534016_7233_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 52720 52716 52716 52716 (java) 89736 8182 4495249408 923677 /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg --app.conf.path --arg application.conf --arg --run_type --arg AUTO --arg --bus_date --arg 2018-03-27 --arg --code_base_id --arg EntLine-1.0-SNAPSHOT --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties
|- 52716 52714 52716 52716 (bash) 2 0 108998656 389 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/../../../CDH-5.5.1-1.cdh5.5.1.p0.11/lib/hadoop/lib/native: /usr/java/jdk1.7.0_67-cloudera/bin/java -server -Xmx3072m -Djava.io.tmpdir=/apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/tmp -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001 -XX:MaxPermSize=256m org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.sky.ids.dovetail.asrun.etl.DovetailAsRunETLMain' --jar file:/apps/projects/dovetail_asrun_etl/jars/EntLine-1.0-SNAPSHOT-jar-with-dependencies.jar --arg '--app.conf.path' --arg 'application.conf' --arg '--run_type' --arg 'AUTO' --arg '--bus_date' --arg '2018-03-27' --arg '--code_base_id' --arg 'EntLine-1.0-SNAPSHOT' --executor-memory 4096m --executor-cores 6 --properties-file /apps/hadoop/data04/yarn/nm/usercache/bdbuild/appcache/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1521733534016_7233/container_e98_1521733534016_7233_02_000001/stderr
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Failing this attempt. Failing the application.
As per my CDH
Container Memory[Amount of physical memory, in MiB, that can be allocated for containers]
yarn.nodemanager.resource.memory-mb 50655 MiB
Please see the containers running in my driver node
Why are there many containers running in one node .
I know that container_e98_1521733534016_7880_02_000001 is for my spark application . I don't know about other containers ? Any idea on that ?
Also I see that physical memory for container_e98_1521733534016_7880_02_000001 is 3584 which is close to 3.5 GB
What does this error mean? Whe it usally occurs?
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Could some one help me on this issue?
container_e98_1521733534016_7233_02_000001 is the first container started and given MASTER_URL=yarn-cluster that's not only the ApplicationMaster, but also the driver of the Spark application.
It appears that the memory setting for the driver, i.e. DRIVER_MEMORY=3G, is too low and you have to bump it up.
Spark on YARN runs two executors by default (see --num-executors) and so you'll end up with 3 YARN containers with 000001 for the ApplicationMaster (perhaps with the driver) and 000002 and 000003 for the two executors.
What is 3.5 GB of 3.5 GB physical memory? Is it driver memory?
Since you use yarn-cluster the driver, the ApplicationMaster and container_e98_1521733534016_7233_02_000001 are all the same and live in the same JVM. That gives that the error is about how much memory you assigned to the driver.
My understanding is that you gave DRIVER_MEMORY=3G which happened to have been too little for your processing and once YARN figured it out killed the driver (and hence the entire Spark application as it's not possible to have a Spark application up and running without the driver).
See the document Running Spark on YARN.
A small addition to what #Jacek already wrote to answer the question
why you get 3.5GB instead of 3GB?
is that apart the DRIVER_MEMORY=3G you need to consider spark.driver.memoryOverhead which can be calculated as MIN(DRIVER_MEMORY * 0.10, 384)MB = 384MB + 3GB ~ 3.5GB

SnappyData : java.lang.OutOfMemoryError: GC overhead limit exceeded

I have 1.2GB of orc data on S3 and I am trying to do the following with the same :
1) Cache the data on snappy cluster [snappydata 0.9]
2) Execute a groupby query on the cached dataset
3) Compare the performance with Spark 2.0.0
I am using a 64 GB/8 core machine and the configuration for the Snappy Cluster are as follows:
$ cat locators
localhost
$cat leads
localhost -heap-size=4096m -spark.executor.cores=1
$cat servers
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m
localhost -heap-size=6144m
Now, I have written a small python script, to cache the orc data from S3 and run a simple group by query, which is as as follows :
from pyspark.sql.snappy import SnappyContext
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName('snappy_sample')
sc = SparkContext(conf=conf)
sqlContext = SnappyContext(sc)
sqlContext.sql("CREATE EXTERNAL TABLE if not exists my_schema.my_table using orc options(path 's3a://access_key:secret_key#bucket_name/path')")
sqlContext.cacheTable("my_schema.my_table")
out = sqlContext.sql("select * from my_schema.my_table where (WeekId = '1') order by sum_viewcount desc limit 25")
out.collect()
The above script is executed using the following command:
spark-submit --master local[*] snappy_sample.py
and I get the following error :
17/10/04 02:50:32 WARN memory.MemoryStore: Not enough space to cache rdd_2_5 in memory! (computed 21.2 MB so far)
17/10/04 02:50:32 WARN memory.MemoryStore: Not enough space to cache rdd_2_0 in memory! (computed 21.2 MB so far)
17/10/04 02:50:32 WARN storage.BlockManager: Persisting block rdd_2_5 to disk instead.
17/10/04 02:50:32 WARN storage.BlockManager: Persisting block rdd_2_0 to disk instead.
17/10/04 02:50:47 WARN storage.BlockManager: Putting block rdd_2_2 failed due to an exception
17/10/04 02:50:47 WARN storage.BlockManager: Block rdd_2_2 could not be removed as it was not found on disk or in memory
17/10/04 02:50:47 ERROR executor.Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:96)
at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:97)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1$$anonfun$next$2.apply(InMemoryRelation.scala:135)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1$$anonfun$next$2.apply(InMemoryRelation.scala:134)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:134)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:98)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:232)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:331)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:496)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
17/10/04 02:50:47 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-2,5,main]
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.sql.execution.columnar.compression.CompressibleColumnBuilder$class.build(CompressibleColumnBuilder.scala:96)
at org.apache.spark.sql.execution.columnar.NativeColumnBuilder.build(ColumnBuilder.scala:97)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1$$anonfun$next$2.apply(InMemoryRelation.scala:135)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1$$anonfun$next$2.apply(InMemoryRelation.scala:134)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:134)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$1$$anon$1.next(InMemoryRelation.scala:98)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:232)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:331)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
at org.apache.spark.sql.execution.WholeStageCodegenRDD.compute(WholeStageCodegenExec.scala:496)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:320)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:284)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
17/10/04 02:50:48 INFO snappystore: VM is exiting - shutting down distributed system
Apart from above error, how do I check if the data has been cached in snappy cluster?
1) Firstly it does not look like you are connecting to the SnappyData cluster with the python script rather running it in local mode. In that case the JVM launched by the python script is failing with OOM as would be expected. When using python connect to SnappyData cluster in the "smart connector" mode:
spark-submit --master local[*] --conf snappydata.connection=locator:1527 snappy_sample.py
The host:port above is the locator host and port on which thrift server is running (1527 by default).
2) Secondly the example you have will just cache using Spark. If you want to use SnappyData, load into a column table:
from pyspark.sql.snappy import SnappySession
from pyspark import SparkContext,SparkConf
conf = SparkConf().setAppName('snappy_sample')
sc = SparkContext(conf=conf)
session = SnappySession(sc)
session.sql("CREATE EXTERNAL TABLE if not exists my_table using orc options(path 's3a://access_key:secret_key#bucket_name/path')")
session.table("my_table").write.format("column").saveAsTable("my_column_table")
out = session.sql("select * from my_column_table where (WeekId = '1') order by sum_viewcount desc limit 25")
out.collect()
Also note the use of "SnappySession" rather than context which is deprecated since Spark 2.0.x. When comparing against Spark caching, you can use the "cacheTable" in a separate script and run against upstream Spark. Note that "cacheTable" will do the caching lazily meaning that first query will do the actual caching so first query run will be very slow with Spark caching but subsequent ones should be faster.
3) Update to the 1.0 release that has many improvements rather than using 0.9. You will also need to add hadoop-aws-2.7.3 and aws-java-sdk-1.7.4 to the "-classpath" in conf/leads and conf/servers (or put into jars directory of the product) before launching the cluster.

./spark-shell doesn't start correctly (spark1.6.1-bin.hadoop2.6 version)

I installed this spark version: spark-1.6.1-bin-hadoop2.6.tgz.
Now when I start spark with ./spark-shell command Im getting this issues (it shows a lot of error lines so I just put some that seems important)
Cleanup action completed
16/03/27 00:19:35 ERROR Schema: Failed initialising database.
Failed to create database 'metastore_db', see the next exception for details.
org.datanucleus.exceptions.NucleusDataStoreException: Failed to create database 'metastore_db', see the next exception for details.
at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
Caused by: java.sql.SQLException: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 more
Caused by: ERROR XBM0H: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
Nested Throwables StackTrace:
java.sql.SQLException: Failed to create database 'metastore_db', see the next exception for details.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 more
Caused by: ERROR XBM0H: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
at org.apache.derby.iapi.error.StandardException.newException
Caused by: java.sql.SQLException: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
at
... 128 more
<console>:16: error: not found: value sqlContext
import sqlContext.implicits._
^
<console>:16: error: not found: value sqlContext
import sqlContext.sql
^
scala>
I tried some configurations to fix this issue that I search in other questions about the value sqlContext not found issue, like:
/etc/hosts file:
127.0.0.1 hadoophost localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.2.0.15 hadoophost
echo $HOSTNAME returns:
hadoophost
.bashrc file contains:
export SPARK_LOCAL_IP=127.0.0.1
But dont works, can you give some help to try understand why spark is not starting correctly?
hive-default.xml.template
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--><configuration>
<!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
In the home folder I get the same issues:
[hadoopadmin#hadoop home]$ pwd
/home
[hadoopadmin#hadoop home]$
Folder permissions:
[hadoopdadmin#hadoop spark-1.6.1-bin-hadoop2.6]$ ls -la
total 1416
drwxr-xr-x. 12 hadoop hadoop 4096 .
drwxr-xr-x. 16 root root 4096 ..
drwxr-xr-x. 2 hadoop hadoop 4096 bin
-rw-r--r--. 1 hadoop hadoop 1343562 CHANGES.txt
drwxr-xr-x. 2 hadoop hadoop 4096 conf
drwxr-xr-x. 3 hadoop hadoop 4096 data
drwxr-xr-x. 3 hadoop hadoop 4096 ec2
drwxr-xr-x. 3 hadoop hadoop 4096 examples
drwxr-xr-x. 2 hadoop hadoop 4096 lib
-rw-r--r--. 1 hadoop hadoop 17352 LICENSE
drwxr-xr-x. 2 hadoop hadoop 4096 licenses
-rw-r--r--. 1 hadoop hadoop 23529 NOTICE
drwxr-xr-x. 6 hadoop hadoop 4096 python
drwxr-xr-x. 3 hadoop hadoop 4096 R
-rw-r--r--. 1 hadoop hadoop 3359 README.md
-rw-r--r--. 1 hadoop hadoop 120 RELEASE
drwxr-xr-x. 2 hadoop hadoop 4096 sbin
Apparently you don't have permissions to write in that directory, I recommend you to run ./spark-shell in your HOME (you might want to add that command to your PATH), or in any other directory accessible and writable by your user.
This might also be relevant for you Notebooks together with Spark
You are using spark built with hive support.
There are two possible solutions based on what you want to do later with your spark-shell or in your spark jobs -
You want to access hive tables in your hadoop+hive installation.
You should place hive-site.xml in your spark installation's conf sub-directory. Find hive-site.xml from your existing hive installation. For example, in my cloudera VM the hive-site.xml is at /usr/lib/hive/conf. Launching the spark-shell after doing this step should successfully connect to existing hive metastore and will not try to create a temporary .metastore database in your current working directory.
You do NOT want to access hive tables in your hadoop+hive installation.
If you do not care about connecting to hive tables, then you can follow Alberto's solution. Fix the permission issues in the directory from which you are launching spark-shell. Make sure you are allowed to create directories/files in that directory.
Hope this helps.

About saving a model file in spark

I'm testing codes of Linear Support Vector Machines (SVMs) in below link:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms
I tested code with spark-shell --master spark://192.168.0.181:7077.
I modified last 2 lines like this:
model.save(sc, "file:///Apps/spark/data/mllib/testModelPath")
val sameModel = SVMModel.load(sc, "file:///Apps/spark/data/mllib/testModelPath")
model.save ended with no error but when I tried to load that model it gives following message:
INFO mapred.FileInputFormat: Total input paths to process : 0
java.lang.UnsupportedOperationException: empty collection
:
:
When I tested without file:/// , model saved to HDFS system, and I can load that model without error.
hadoop#XXX /Apps/spark/data/mllib/testModelPath/data> ll
drwxrwxr-x 2 hadoop hadoop 4096 2015-10-07 16:47 ./
drwxrwxr-x 4 hadoop hadoop 4096 2015-10-07 16:47 ../
-rw-rw-r-- 1 hadoop hadoop 8 2015-10-07 16:47 ._SUCCESS.crc
-rw-rw-r-- 1 hadoop hadoop 16 2015-10-07 16:47 ._common_metadata.crc
-rw-r--r-- 1 hadoop hadoop 0 2015-10-07 16:47 _SUCCESS
-rw-r--r-- 1 hadoop hadoop 932 2015-10-07 16:47 _common_metadata
When I checked the folder after model saving, I found _metadata file is not created.
Dose anyone know the reason of this?
I meet the same problem. The problem is caused by save and load function. If you run spark in multi nodes, when you save the model, the spark save API just saves the corresponding partition on the node, so the model on every node is incomplete. But the load function's source code use the textFile API to load the model, and textFile needs that the file is same on every node. So it cause the problem. The method I solved it is save the model on HDFS, while it wastes the space.

Resources