Input string error after a spark-submit - string

I'm trying to run some Spark Scala code :
import org.apache.spark.{SparkConf, SparkContext}
import scala.collection.mutable.ListBuffer
object EzRecoMRjobs {
def main(args: Array[String]) {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("Product Cardinalities")
val sc = new SparkContext(conf)
val dataset = sc.textFile(args(0))
// Load parameters
val customerIndex = args(1).toInt - 1
val ProductIndex = args(2).toInt - 1
val outputPath = args(3).toString
val resu = dataset.map( line => { val orderId = line.split("\t")(0)
val cols = line.split("\t")(1).split(";")
cols(ProductIndex)
})
.map( x => (x,1) )
.reduceByKey(_ + _)
.saveAsTextFile(outputPath)
sc.stop()
}
}
This code works in Intellij and write the result in the "outputPath" folder.
From my Intellij project I have generate a .jar file and I want to run this code with a spark-submit. So in my terminal I launch :
spark-submit \
--jars /Users/users/Documents/TestScala/ezRecoPreBuild/target/ezRecoPreBuild-1.0-SNAPSHOT.jar \
--class com.np6.scala.EzRecoMRjobs \
--master local \
/Users/users/Documents/DATA/data.txt 1 2 /Users/users/Documents/DATA/dossier
But I got this error :
Exception in thread "main" java.lang.NumberFormatException: For input string: "/Users/users/Documents/DATA/dossier"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:569)
at java.lang.Integer.parseInt(Integer.java:615)
at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
at com.np6.scala.EzRecoMRjobs$.main(ezRecoMRjobs.scala:51)
at com.np6.scala.EzRecoMRjobs.main(ezRecoMRjobs.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
What's the reason of this error ? Thank's

Check out the documentation: https://spark.apache.org/docs/latest/submitting-applications.html
The first application argument is expected to be the jar file path, so obviously your are getting a NumberFormatException, because your code is parsing the last argument (which is a String) as a number.
The --jars flag is used to specify additional jars that will be used in your application.
You must run the spark-submit command in this way:
spark-submit \
--class com.np6.scala.EzRecoMRjobs \
--master local[*] \
/Users/users/Documents/TestScala/ezRecoPreBuild/target/ezRecoPreBuild-1.0-SNAPSHOT.jar /Users/users/Documents/DATA/data.txt 1 2 /Users/users/Documents/DATA/dossier
Hope it helps.

Related

Unable to use a local file using spark-submit

I am trying to execute a spark word count program. My input file and output dir are on local and not on HDFS. When I execute the code, I get input directory not found exception.
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object WordCount {
val sparkConf = new SparkConf()
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().config(sparkConf).master("yarn").getOrCreate()
val input = args(0)
val output = args(1)
val text = spark.sparkContext.textFile("input",1)
val outPath = text.flatMap(line => line.split(" "))
val words = outPath.map(w => (w,1))
val wc = words.reduceByKey((x,y)=>(x+y))
wc.saveAsTextFile("output")
}
}
Spark Submit:
spark-submit --class com.practice.WordCount sparkwordcount_2.11-0.1.jar --files home/hmusr/ReconTest/inputdir/sample /home/hmusr/ReconTest/inputdir/wordout
I am using the option --files to fetch the local input file and point the output to output dir in spark-submit. When I submit the jar using spark-submit, it says input path does not exist:
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://dev/user/hmusr/input
Could anyone let me know what is the mistake I am doing here ?
A couple of things:
val text = spark.sparkContext.textFile(input,1)
To use a variable, remove double quotes, is input not "input".
You expect input and output as an argument so in spark submit after jar (without --files) and use master as local.
Also, use file:// to use local files.
Your spark-submit should look something like:
spark-submit --master local[2] \
--class com.practice.WordCount \
sparkwordcount_2.11-0.1.jar \
file:///home/hmusr/ReconTest/inputdir/sample \
file:///home/hmusr/ReconTest/inputdir/wordout

Streaming avro files from a directory

I'm trying to set up a structured stream from a directory of Avro files. We already have some non-streaming code to deal with exact the same data, so the least-effort step forward to streaming would be to re-use that code.
To move to StructuredStreaming, I tried the following, which works in the non-streaming manner (using read in stead of readStream) but gives me a serialization error in the streaming approach.
import com.databricks.spark.avro._
import org.apache.avro._
import org.apache.spark.sql.types._
import com.databricks.spark.avro._
val schemaStr = """ {our_schema_here} """
val parser = new Schema.Parser()
val avroSchema = parser.parse(schemaStr)
val structType = SchemaConverters.toSqlType(avroSchema).dataType match {
case t: StructType => Some(t)
case _ => throw new RuntimeException(
s"""Avro schema cannot be converted to a Spark SQL StructType:
|
|${avroSchema.toString(true)}
|""".stripMargin)
}
val path = "dbfs://path/to/avro/files/*"
val avroStream = sqlContext
.readStream
.schema(structType.get)
.format("com.databricks.spark.avro")
.option("maxFilesPerTrigger", 5)
.load(path)
.writeStream
.outputMode("append")
.format("memory")
.queryName("counts")
.start()
The exception I get is shown below. Note, I can't get the full stacktrace as I'm on Databricks and get access the executors logs. I'm a bit at loss what exactly is the object that can't be serialized.
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2125)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:937)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:936)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:299)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:291)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2966)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2456)
at org.apache.spark.sql.Dataset$$anonfun$57.apply(Dataset.scala:2950)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2949)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2456)
at org.apache.spark.sql.execution.streaming.MemorySink.addBatch(memory.scala:217)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1$$anonfun$apply$mcV$sp$1.apply(StreamExecution.scala:731)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:99)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply$mcV$sp(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch$1.apply(StreamExecution.scala:730)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:729)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:328)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:62)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:316)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:312)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:226)
Caused by: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1
Serialization stack:
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 41 more

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

teradata jdbc jar not loading in spark

I'm trying to load the teradata jar file in spark but can't get it to load. I start spark shell like this:
spark-shell --jars ~/*.jar --driver-class-path ~/*.jar
in there I have a jar file called terajdbc4.jar
when spark shell starts...I do this
scala> sc.addJar("terajdbc4.jar")
15/12/07 12:27:55 INFO SparkContext: Added JAR terajdbc4.jar at http://1.2.4.4:41601/jars/terajdbc4.jar with timestamp 1449509275187
scala> sc.jars
res1: Seq[String] = List(file:/home/user1/spark-cassandra-connector_2.10-1.0.0-beta1.jar)
scala>
but its not there in the jars. why is it missing still?
EDIT:
ok. I got the jar to load, but I'm getting this error:
java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
I do the following:
scala> sc.jars
res4: Seq[String] = List(file:/home/user/terajdbc4.jar)
scala> import com.teradata.jdbc.TeraDriver
import com.teradata.jdbc.TeraDriver
scala> Class.forName("com.teradata.jdbc.TeraDriver")
res5: Class[_] = class com.teradata.jdbc.TeraDriver
and then this:
val jdbcDF = sqlContext.load("jdbc", Map(
"url" -> "jdbc:teradata://dbinstn, TMODE=TERA, user=user1, password=pass1",
"dbtable" -> "db1a.table1a",
"driver" -> "com.teradata.jdbc.TeraDriver"))
and then I get this:
java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver
spark-shell --jars ~/*.jar --driver-class-path ~/*.jar
please refer to Using wildcards in java classpath
The wildcards like *.jar is not supported, please try to add specific jar file path.

spark-submit is not exiting until I hit ctrl+C

I am running this spark command to run spark Scala program successfully using Hortonworks vm. But once the job is completed it is not exiting from spark-submit command until I hit ctrl+C. Why?
spark-submit --class SimpleApp --master yarn-client --num-executors 3 --driver-memory 512m --executor-memory12m --executor-cores 1 target/scala-2.10/application_2.10-1.0.jar /user/root/decks/largedeck.txt
Here is the code, I am running.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val cards = sc.textFile(args(0)).flatMap(_.split(" "))
val cardCount = cards.count()
println(cardCount)
}
}
You have to call stop() on context to exit your program cleanly.
I had the same kind of problem when writing files to S3. I use the spark 2.0 version, even after adding stop() if it didn't work for you. Try the below settings
In Spark 2.0 you can use,
val spark = SparkSession.builder().master("local[*]").appName("App_name").getOrCreate()
spark.conf.set("spark.hadoop.mapred.output.committer.class","com.appsflyer.spark.DirectOutputCommitter")
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Resources