Unable to use a local file using spark-submit - apache-spark

I am trying to execute a spark word count program. My input file and output dir are on local and not on HDFS. When I execute the code, I get input directory not found exception.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("yarn").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile("input",1)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile("output")
}
}
Spark Submit:
spark-submit --class com.practice.WordCount sparkwordcount_2.11-0.1.jar --files home/hmusr/ReconTest/inputdir/sample /home/hmusr/ReconTest/inputdir/wordout
I am using the option --files to fetch the local input file and point the output to output dir in spark-submit. When I submit the jar using spark-submit, it says input path does not exist:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://dev/user/hmusr/input
Could anyone let me know what is the mistake I am doing here ?

A couple of things:
val text = spark.sparkContext.textFile(input,1)
To use a variable, remove double quotes, is input not "input".
You expect input and output as an argument so in spark submit after jar (without --files) and use master as local.
Also, use file:// to use local files.
Your spark-submit should look something like:
spark-submit --master local[2] \
--class com.practice.WordCount \
sparkwordcount_2.11-0.1.jar \
file:///home/hmusr/ReconTest/inputdir/sample \
file:///home/hmusr/ReconTest/inputdir/wordout

Related

Load file using SparkContext.addFile and load the file using load or csv method

I am trying to load a Testfile using spark and java. Code is working fine in client mode(in my local machine) but It's giving FileNotFound Exception in cluster mode(i.e. on the server).
SparkSession spark = SparkSession
.builder()
.config("spark.mesos.coarse","true")
.config("spark.scheduler.mode","FAIR")
.appName("1")
.master("local")
.getOrCreate();
spark.sparkContext().addFile("https://mywebsiteurl/TestFile.csv");
String[] fileServerUrlArray = fileServerUrl.split("/");
fileName = fileServerUrlArray[fileServerUrlArray.length - 1];
String file = SparkFiles.get(fileName);
String modifiedFile="file://"+file;
spark.read()
.option("header", "true")
.load(modifiedFile); //getting FileNotFoundException in this line
getting FileNotFound Exception.
While running your job in cluster mode, spark will never write on local area of the driver. Best option will be to collect() or use toLocalIterator() if you can read the file in buffer. Please try using below code and share if it's working for you?
import org.apache.hadoop.fs._
val conf = new Configuration()
val fs = path.getFileSystem(conf)
val hdfspath = new Path("hdfs:///user/home/testFile.dat")
val localpath = new Path("file:///user/home/test/")
fs.copyToLocalFile(hdfspath,localpath)

How to read a part-00000.deflate file on linux

I have written a spark word count program using below code:
package com.practice
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("local[2]").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile(input)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile(output)
}
}
Using spark submit, I run the jar and got the output in the output dir:
SPARK_MAJOR_VERSION=2 spark-submit --master local[2] --class com.practice.WordCount sparkwordcount_2.11-0.1.jar file:///home/hmusr/ReconTest/inputdir/sample file:///home/hmusr/ReconTest/inputdir/output
Both the input and output files are on local and not on HFDS.
In the output dir, I see two files: part-00000.deflate _SUCCESS.
The output file is present with .deflate extension. I understood that the output was saved in a compressed file after checking internet but is there any way I can read the file ?
Try this one.
cat part-00000.deflate | perl -MCompress::Zlib -e 'undef $/; print uncompress(<>)'

Input string error after a spark-submit

I'm trying to run some Spark Scala code :
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object EzRecoMRjobs {
def main(args: Array[String]) {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("Product Cardinalities")
val sc = new SparkContext(conf)
val dataset = sc.textFile(args(0))
// Load parameters
val customerIndex = args(1).toInt - 1
val ProductIndex = args(2).toInt - 1
val outputPath = args(3).toString
val resu = dataset.map( line => { val orderId = line.split("\t")(0)
val cols = line.split("\t")(1).split(";")
cols(ProductIndex)
})
.map( x => (x,1) )
.reduceByKey(_ + _)
.saveAsTextFile(outputPath)
sc.stop()
}
}
This code works in Intellij and write the result in the "outputPath" folder.
From my Intellij project I have generate a .jar file and I want to run this code with a spark-submit. So in my terminal I launch :
spark-submit \
--jars /Users/users/Documents/TestScala/ezRecoPreBuild/target/ezRecoPreBuild-1.0-SNAPSHOT.jar \
--class com.np6.scala.EzRecoMRjobs \
--master local \
/Users/users/Documents/DATA/data.txt 1 2 /Users/users/Documents/DATA/dossier
But I got this error :
Exception in thread "main" java.lang.NumberFormatException: For input string: "/Users/users/Documents/DATA/dossier"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at com.np6.scala.EzRecoMRjobs$.main(ezRecoMRjobs.scala:51)
at com.np6.scala.EzRecoMRjobs.main(ezRecoMRjobs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What's the reason of this error ? Thank's
Check out the documentation: https://spark.apache.org/docs/latest/submitting-applications.html
The first application argument is expected to be the jar file path, so obviously your are getting a NumberFormatException, because your code is parsing the last argument (which is a String) as a number.
The --jars flag is used to specify additional jars that will be used in your application.
You must run the spark-submit command in this way:
spark-submit \
--class com.np6.scala.EzRecoMRjobs \
--master local[*] \
/Users/users/Documents/TestScala/ezRecoPreBuild/target/ezRecoPreBuild-1.0-SNAPSHOT.jar /Users/users/Documents/DATA/data.txt 1 2 /Users/users/Documents/DATA/dossier
Hope it helps.

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

spark-submit is not exiting until I hit ctrl+C

I am running this spark command to run spark Scala program successfully using Hortonworks vm. But once the job is completed it is not exiting from spark-submit command until I hit ctrl+C. Why?
spark-submit --class SimpleApp --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory12m --executor-cores 1 target/scala-2.10/application_2.10-1.0.jar /user/root/decks/largedeck.txt
Here is the code, I am running.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val cards = sc.textFile(args(0)).flatMap(_.split(" "))
val cardCount = cards.count()
println(cardCount)
}
}
You have to call stop() on context to exit your program cleanly.
I had the same kind of problem when writing files to S3. I use the spark 2.0 version, even after adding stop() if it didn't work for you. Try the below settings
In Spark 2.0 you can use,
val spark = SparkSession.builder().master("local[*]").appName("App_name").getOrCreate()
spark.conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Resources