spark-submit is not exiting until I hit ctrl+C - apache-spark

I am running this spark command to run spark Scala program successfully using Hortonworks vm. But once the job is completed it is not exiting from spark-submit command until I hit ctrl+C. Why?
spark-submit --class SimpleApp --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory12m --executor-cores 1 target/scala-2.10/application_2.10-1.0.jar /user/root/decks/largedeck.txt
Here is the code, I am running.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val cards = sc.textFile(args(0)).flatMap(_.split(" "))
val cardCount = cards.count()
println(cardCount)
}
}

You have to call stop() on context to exit your program cleanly.

I had the same kind of problem when writing files to S3. I use the spark 2.0 version, even after adding stop() if it didn't work for you. Try the below settings
In Spark 2.0 you can use,
val spark = SparkSession.builder().master("local[*]").appName("App_name").getOrCreate()
spark.conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Related

Unable to use a local file using spark-submit

I am trying to execute a spark word count program. My input file and output dir are on local and not on HDFS. When I execute the code, I get input directory not found exception.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("yarn").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile("input",1)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile("output")
}
}
Spark Submit:
spark-submit --class com.practice.WordCount sparkwordcount_2.11-0.1.jar --files home/hmusr/ReconTest/inputdir/sample /home/hmusr/ReconTest/inputdir/wordout
I am using the option --files to fetch the local input file and point the output to output dir in spark-submit. When I submit the jar using spark-submit, it says input path does not exist:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://dev/user/hmusr/input
Could anyone let me know what is the mistake I am doing here ?
A couple of things:
val text = spark.sparkContext.textFile(input,1)
To use a variable, remove double quotes, is input not "input".
You expect input and output as an argument so in spark submit after jar (without --files) and use master as local.
Also, use file:// to use local files.
Your spark-submit should look something like:
spark-submit --master local[2] \
--class com.practice.WordCount \
sparkwordcount_2.11-0.1.jar \
file:///home/hmusr/ReconTest/inputdir/sample \
file:///home/hmusr/ReconTest/inputdir/wordout

Save CSV file to hbase table using Spark and Phoenix

Can someone point me to a working example of saving a csv file to Hbase table using Spark 2.2
Options that I tried and failed (Note: all of them work with Spark 1.6 for me)
phoenix-spark
hbase-spark
it.nerdammer.bigdata : spark-hbase-connector_2.10
All of them finally after fixing everything give similar error to this Spark HBase
Thanks
Add below parameters to your spark job-
spark-submit \
--conf "spark.yarn.stagingDir=/somelocation" \
--conf "spark.hadoop.mapreduce.output.fileoutputformat.outputdir=/s‌​omelocation" \
--conf "spark.hadoop.mapred.output.dir=/somelocation"
Phoexin has plugin and jdbc thin client which can connect(read/write) to HBASE, example are in https://phoenix.apache.org/phoenix_spark.html
Option 1 : Connect via zookeeper url - phoenix plugin
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.phoenix.spark._
val sc = new SparkContext("local", "phoenix-test")
val sqlContext = new SQLContext(sc)
val df = sqlContext.load(
"org.apache.phoenix.spark",
Map("table" -> "TABLE1", "zkUrl" -> "phoenix-server:2181")
)
df
.filter(df("COL1") === "test_row_1" && df("ID") === 1L)
.select(df("ID"))
.show
Option 2 : Use JDBC thin client provied by phoenix query server
more info on https://phoenix.apache.org/server.html
jdbc:phoenix:thin:url=http://localhost:8765;serialization=PROTOBUF

Unable to access HBase Via API

I have hbase installed over three nodes. I am trying to load hbase via spark with the help of below code.
from __future__ import print_function
import sys
from pyspark import SparkContext
import json
if __name__ == "__main__":
print ("*******************************")
sc = SparkContext(appName="HBaseOutputFormat")
host = sys.argv[1]
table = "hbase_test"
port = "2181"
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"hbase.zookeeper.property.clientPort":port,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
rdd = sc.parallelize([sys.argv[2:]]).map(lambda x: (x[0], x))
print (rdd.collect())
rdd.saveAsNewAPIHadoopDataset(
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
sc.stop()
I am executing code as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py HBASE_MASTER_ip row1 f1 q1 value1
But the job stucks and doesn't proceed. Below is the snapshot:
As per some previous threads I tried changing /etc/hosts to comment localhost line but it didn't worked.
Requesting your help.
On further debugging I referred to below blog post from Hortononworks link for best practice:
https://community.hortonworks.com/articles/4091/hbase-client-application-best-practices.html
I have added hbase configuration file to driver class path and ran the code and it worked perfectly fine.
Modified spark-submit can be viewed as:
spark-submit --driver-class-path /usr/iop/4.3.0.0-0000/hbase/lib/hbase-server.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-common.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-client.jar:/usr/iop/4.3.0.0-0000/hbase/lib/zookeeper.jar:/usr/iop/4.3.0.0-0000/hbase/lib/hbase-protocol.jar:/usr/iop/4.3.0.0-0000/spark2/examples/jars/scopt_2.11-3.3.0.jar:/home/tanveer/spark-examples_2.10-1.1.0.jar:**/etc/hbase/conf** --conf spark.ui.port=5054 --master local[2] /data/usr/tanveer/from_home/spark/hbase_outputformat.py host row1 f1 q1 value1

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

Error when creating a StreamingContext

I open the spark shell
spark-shell --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0
Then I want to create a streaming context
import org.apache.spark._
import org.apache.spark.streaming._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(1))
I run into a exception:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
When you open the spark-shell, there is already a streaming context created. It is called sc, meaning you do not need to create a configure object. Simply use the existing sc object.
val ssc = new StreamingContext(sc,Seconds(1))
instead of var we will use val

Resources