About saving a model file in spark - apache-spark

I'm testing codes of Linear Support Vector Machines (SVMs) in below link:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms
I tested code with spark-shell --master spark://192.168.0.181:7077.
I modified last 2 lines like this:
model.save(sc, "file:///Apps/spark/data/mllib/testModelPath")
val sameModel = SVMModel.load(sc, "file:///Apps/spark/data/mllib/testModelPath")
model.save ended with no error but when I tried to load that model it gives following message:
INFO mapred.FileInputFormat: Total input paths to process : 0
java.lang.UnsupportedOperationException: empty collection
:
:
When I tested without file:/// , model saved to HDFS system, and I can load that model without error.
hadoop#XXX /Apps/spark/data/mllib/testModelPath/data> ll
drwxrwxr-x 2 hadoop hadoop 4096 2015-10-07 16:47 ./
drwxrwxr-x 4 hadoop hadoop 4096 2015-10-07 16:47 ../
-rw-rw-r-- 1 hadoop hadoop 8 2015-10-07 16:47 ._SUCCESS.crc
-rw-rw-r-- 1 hadoop hadoop 16 2015-10-07 16:47 ._common_metadata.crc
-rw-r--r-- 1 hadoop hadoop 0 2015-10-07 16:47 _SUCCESS
-rw-r--r-- 1 hadoop hadoop 932 2015-10-07 16:47 _common_metadata
When I checked the folder after model saving, I found _metadata file is not created.
Dose anyone know the reason of this?

I meet the same problem. The problem is caused by save and load function. If you run spark in multi nodes, when you save the model, the spark save API just saves the corresponding partition on the node, so the model on every node is incomplete. But the load function's source code use the textFile API to load the model, and textFile needs that the file is same on every node. So it cause the problem. The method I solved it is save the model on HDFS, while it wastes the space.

Related

Reclaiming tables corrupted when hdfs volume was at 100%

I am using hadoop version Hadoop 2.7.0-mapr-1506 .
When data volume is at 100%, our jobs still tried to insert overwrite data to few hive tables and they are corrupted and gives the below exception when accessed,
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: maprfs:/hive/bigdata.db/cadfp_wmt_table
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:289)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
Now we have freed up space in the data volume and want to reclaim the data in the below tables, how can we achieve it
hadoop fs -ls /hive/bigdata.db/ | grep tmp~
drwxr-xr-x - bigdata bigdata 16 2019-04-05 07:38 /hive/bigdata.db/pc_bt_clean_table.tmp~!#
drwxr-xr-x - bigdata bigdata 209 2019-04-05 07:51 /hive/bigdata.db/pc_bt_table.tmp~!#
drwxr-xr-x - bigdata bigdata 1081 2019-04-05 07:38 /hive/bigdata.db/cadfp_wmt_table.tmp~!#
Tried steps mentioned here How to fix corrupt HDFS FIles but hdfs command does not work for me

The root scratch dir: /tmp/hive on HDFS should be writable Spark app error

I have created a Spark application which uses Hive metastore but in the line of the external Hive table creation, I get such an error when I execute the application (Spark driver logs):
Exception in thread "main" org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:214)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwxrwxr-x
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
I run the application using the Spark operator for K8s.
So I checked the permissions of the directories ob driver pod of the Spark application:
ls -l /tmp
...
drwxrwxr-x 1 1001 1001 4096 Feb 22 16:47 hive
If I try to change permissions it does not make any effect.
I run Hive metastore and HDFS in K8s as well.
How this problem can be fixed?
This is a common error which can be fixed by creating a directory at another place and pointing the spark to use the new dir.
Step 1: Create a new dir called tmpops at /tmp/tmpops
Step 2: Give permission for the dir chmod -777 /tmp/tmpops
Note: -777 is for local testing. If you are working with sensitive data make sure to add this path to security groups to avoid accidental data leakage and security loophole.
Step 3: Add the below property in your hive-site.xml that the spark app is referring to:
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/tmpops</value>
</property>
Once you do this, the error will no longer appear unless someone deletes that dir.
I face the same issue in window 10, below solution helped me to get this fixed.
Following steps solved my problem
Open Command Prompt in Admin Mode
winutils.exe chmod 777 /tmp/hive
Open Spark-Shell --master local[2]

Spark-submit create only 1 executor when pyspark interactive shell create 4 (both using yarn-client)

I'm using the quickstart cloudera VM (CDH 5.10.1) with Pyspark (1.6.0) and Yarn (MR2 Included) to aggregate numerical data per hour. I've got 1 CPU with 4 cores and 32 Go of RAM.
I've got a file named aggregate.py but until today I never submitted the job with spark-submit, I used pyspark interactive shell and copy/paste the code to test it.
When starting pyspark interactive shell I used :
pyspark --master yarn-client
I followed the treatment in the web UI accessible at quickstart.cloudera:8088/cluster and could see that Yarn created 3 executors and 1 driver with one core each (Not a good configuration but the main purpose is to make a proof of concept, until we move to a real cluster)
When submitting the same code with spark-submit :
spark-submit --verbose
--master yarn
--deploy-mode client \
--num-executors 2 \
--driver-memory 3G \
--executor-memory 6G \
--executor-cores 2 \
aggregate.py
I only have the driver, which also executes the tasks. Note that spark.dynamicAllocation.enabled is set to true in the environment tab, and spark.dynamicAllocation.minExecutors is set to 2.
I tried using spark-submit aggregate.py only, I still got only the driver as executor. I can't manage to have more than 1 executor with spark-submit, yet it works in spark interactive shell !
My Yarn configuration is as follow :
yarn.nodemanager.resource.memory-mb = 17 GiB
yarn.nodemanager.resource.cpu-vcores = 4
yarn.scheduler.minimum-allocation-mb = 3 GiB
yarn.scheduler.maximum-allocation-mb = 16 GiB
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 2
If someone can explain me what I'm doing wrong it would be a great help !
You have to set the driver memory and executor memory in to spark-defaults.conf.
It's located at
$SPARK_HOME/conf/spark-defaults.conf
and if there is a file like
spark-defaults.conf.template
then you have to rename the file as
spark-defaults.conf
and then set the number of executors, executor-memory ,number of executor-cores. you get the example from the template file or check this link
https://spark.apache.org/docs/latest/configuration.html.
or
When we used pyspark It's used default executor-memory but here in spark-submit you set executor-memory = 6G. I think you have to reduce the memory or remove this field so it can used default memory.
just a guess, as you said earlier "Yarn created 3 executors and 1 driver with one core each", so you have 4-cores in total.
Now as per your spark-submit statement,
cores = num-executors 2 * executor-cores 2 + for_driver 1 = 5
#but in total you have 4 cores. So it is unable to give you executors(as after driver only 3 cores left)
#Check if this is the issue.

./spark-shell doesn't start correctly (spark1.6.1-bin.hadoop2.6 version)

I installed this spark version: spark-1.6.1-bin-hadoop2.6.tgz.
Now when I start spark with ./spark-shell command Im getting this issues (it shows a lot of error lines so I just put some that seems important)
Cleanup action completed
16/03/27 00:19:35 ERROR Schema: Failed initialising database.
Failed to create database 'metastore_db', see the next exception for details.
org.datanucleus.exceptions.NucleusDataStoreException: Failed to create database 'metastore_db', see the next exception for details.
at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
Caused by: java.sql.SQLException: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 more
Caused by: ERROR XBM0H: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
Nested Throwables StackTrace:
java.sql.SQLException: Failed to create database 'metastore_db', see the next exception for details.
org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
... 128 more
Caused by: ERROR XBM0H: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
at org.apache.derby.iapi.error.StandardException.newException
Caused by: java.sql.SQLException: Directory /usr/local/spark-1.6.1-bin-hadoop2.6/bin/metastore_db cannot be created.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
at
... 128 more
<console>:16: error: not found: value sqlContext
import sqlContext.implicits._
^
<console>:16: error: not found: value sqlContext
import sqlContext.sql
^
scala>
I tried some configurations to fix this issue that I search in other questions about the value sqlContext not found issue, like:
/etc/hosts file:
127.0.0.1 hadoophost localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.2.0.15 hadoophost
echo $HOSTNAME returns:
hadoophost
.bashrc file contains:
export SPARK_LOCAL_IP=127.0.0.1
But dont works, can you give some help to try understand why spark is not starting correctly?
hive-default.xml.template
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
--><configuration>
<!-- WARNING!!! This file is auto generated for documentation purposes ONLY! -->
<!-- WARNING!!! Any changes you make to this file will be ignored by Hive. -->
<!-- WARNING!!! You must make your changes in hive-site.xml instead. -->
In the home folder I get the same issues:
[hadoopadmin#hadoop home]$ pwd
/home
[hadoopadmin#hadoop home]$
Folder permissions:
[hadoopdadmin#hadoop spark-1.6.1-bin-hadoop2.6]$ ls -la
total 1416
drwxr-xr-x. 12 hadoop hadoop 4096 .
drwxr-xr-x. 16 root root 4096 ..
drwxr-xr-x. 2 hadoop hadoop 4096 bin
-rw-r--r--. 1 hadoop hadoop 1343562 CHANGES.txt
drwxr-xr-x. 2 hadoop hadoop 4096 conf
drwxr-xr-x. 3 hadoop hadoop 4096 data
drwxr-xr-x. 3 hadoop hadoop 4096 ec2
drwxr-xr-x. 3 hadoop hadoop 4096 examples
drwxr-xr-x. 2 hadoop hadoop 4096 lib
-rw-r--r--. 1 hadoop hadoop 17352 LICENSE
drwxr-xr-x. 2 hadoop hadoop 4096 licenses
-rw-r--r--. 1 hadoop hadoop 23529 NOTICE
drwxr-xr-x. 6 hadoop hadoop 4096 python
drwxr-xr-x. 3 hadoop hadoop 4096 R
-rw-r--r--. 1 hadoop hadoop 3359 README.md
-rw-r--r--. 1 hadoop hadoop 120 RELEASE
drwxr-xr-x. 2 hadoop hadoop 4096 sbin
Apparently you don't have permissions to write in that directory, I recommend you to run ./spark-shell in your HOME (you might want to add that command to your PATH), or in any other directory accessible and writable by your user.
This might also be relevant for you Notebooks together with Spark
You are using spark built with hive support.
There are two possible solutions based on what you want to do later with your spark-shell or in your spark jobs -
You want to access hive tables in your hadoop+hive installation.
You should place hive-site.xml in your spark installation's conf sub-directory. Find hive-site.xml from your existing hive installation. For example, in my cloudera VM the hive-site.xml is at /usr/lib/hive/conf. Launching the spark-shell after doing this step should successfully connect to existing hive metastore and will not try to create a temporary .metastore database in your current working directory.
You do NOT want to access hive tables in your hadoop+hive installation.
If you do not care about connecting to hive tables, then you can follow Alberto's solution. Fix the permission issues in the directory from which you are launching spark-shell. Make sure you are allowed to create directories/files in that directory.
Hope this helps.

Spark Invalid Checkpoint Directory

I have a long run iteration in my program and I want to cache and checkpoint every few iterations (this technique is suggested to cut long lineage on the web) so I wont have StackOverflowError, by doing this
for (i <- 2 to 100) {
//cache and checkpoint ever 30 iterations
if (i % 30 == 0) {
graph.cache
graph.checkpoint
//I use numEdges in order to start the transformation I need
graph.numEdges
}
//graphs are stored to a list
//here I use the graph of previous iteration to this iteration
//and perform a transformation
}
and I have set the checkpoint directory like this
val sc = new SparkContext(conf)
sc.setCheckpointDir("checkpoints/")
However, when I finally run my program I get an Exception
Exception in thread "main" org.apache.spark.SparkException: Invalid checkpoint directory
I use 3 computers, each computer has Ubuntu 14.04, and I also use a pre-built version of spark 1.4.1 with hadoop 2.4 or later on each computer.
If you already set up HDFS on a cluster of nodes, you can find your hdfs address in "core-site.xml" located in the directory HADOOP_HOME/etc/hadoop. For me, the core-site.xml is set up as:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
</configuration>
Then you can create a directory on hdfs to save Rdd checkpoint files, let's name this directory RddChekPoint, by hadoop hdfs shell:
$ hadoop fs -mkdir /RddCheckPoint
If you use pyspark, after SparkContext is initialized by sc = SparkContext(conf), you can set checkpoint directory by
sc.setCheckpointDir("hdfs://master:9000/RddCheckPoint")
when an Rdd is checkpointed, in the hdfs directory RddCheckPoint, you can see the checkpoint files are saved there, to have a look:
$ hadoop fs -ls /RddCheckPoint
The checkpoint directory needs to be an HDFS compatible directory (from the scala doc "HDFS-compatible directory where the checkpoint data will be reliably stored. Note that this must be a fault-tolerant file system like HDFS"). So if you have HDFS setup on those nodes point it to "hdfs://[yourcheckpointdirectory]".

Resources