Error when creating a StreamingContext - apache-spark

I open the spark shell
spark-shell --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.0
Then I want to create a streaming context
import org.apache.spark._
import org.apache.spark.streaming._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(1))
I run into a exception:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:

When you open the spark-shell, there is already a streaming context created. It is called sc, meaning you do not need to create a configure object. Simply use the existing sc object.
val ssc = new StreamingContext(sc,Seconds(1))
instead of var we will use val

Related

How to setup spark.sql.legacy.allowUntypedScalaUDF flag in SparkConf running in EMR 6.3

Request you to please help me when I encounter an issue where I am not able to set "spark.sql.legacy.allowUntypedScalaUDF" as true in SparkConf.
val sparkConf = new SparkConf()
sparkConf.setIfMissing("spark.master", "local[*]")
sparkConf.set("spark.executorEnv.spark.sql.legacy.allowUntypedScalaUDF",true)
val session = SparkSession.builder()
.config(sparkConf)
//.config("spark.sql.legacy.allowUntypedScalaUDF",true)
.getOrCreate()
It gives me the following result:
[Error] D:\PassAddress.scala.12: type mismatch;
found : Boolean(true)
required: String
one error found
And if I change it to string, then it doesn't shows its result:
val sparkConf = new SparkConf()
sparkConf.setIfMissing("spark.master", "local[*]")
sparkConf.set("spark.executorEnv.spark.sql.legacy.allowUntypedScalaUDF","true")
or if set the setting like below, still result is same. And I am not able to proceed further.
val sparkConf = new SparkConf()
sparkConf.setIfMissing("spark.master", "local[*]")
sparkConf.set("spark.sql.legacy.allowUntypedScalaUDF","true")
My objective is to set this environment flag, so that already existing functionality support Spark 3.0 onwards. Spark 3.0 onwards they removed the default support for UntypeScalaUDF.

Difference between spark_session and sqlContext on loading a local file

I'm tried to load a local file as dataframe with using spark_session and sqlContext.
df = spark_session.read...load(localpath)
It couldn't read local files. df is empty.
But, after creating sqlcontext from spark_context, it could load a local file.
sqlContext = SQLContext(spark_context)
df = sqlContext.read...load(localpath)
It worked fine. But I can't understand why. What is the cause ?
Envionment: Windows10, spark 2.2.1
EDIT
Finally I've resolved this problem. The root cause is version difference between PySpark installed with pip and PySpark installed in local file system. PySpark failed to start because of py4j failing.
I am pasting a sample code that might help. We have used this to create a Sparksession object and read a local file with it:
import org.apache.spark.sql.SparkSession
object SetTopBox_KPI1_1 {
def main(args: Array[String]): Unit = {
if(args.length < 2) {
System.err.println("SetTopBox Data Analysis <Input-File> OR <Output-File> is missing")
System.exit(1)
}
val spark = SparkSession.builder().appName("KPI1_1").getOrCreate()
val record = spark.read.textFile(args(0)).rdd
.....
On the whole, in Spark 2.2 the preferred way to use Spark is by creating a SparkSession object.

I'm trying to read the data from the files in the directory as soon as a new file is created. Real time "File Streaming"

I'm currently learning spark-streaming. I'm trying to read the data from the files in the directory as soon as a new file is created. Real time "File Streaming". I'm getting the below error. Can anyone suggest me a solution?
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FileStreaming {
def main(args:Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.textFileStream("C:\\Users\\PRAGI V\\Desktop\\data-
master\\data-master\\cards")
lines.flatMap(x => x.split(" ")).map(x => (x, 1)).print()
ssc.start()
ssc.awaitTermination()
}
}
Error:
Exception in thread "main" org.apache.spark.SparkException: An application
name must be set in your configuration
at org.apache.spark.SparkContext. <init>(SparkContext.scala:170)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:555)
at org.apache.spark.streaming.StreamingContext.<init>
(StreamingContext.scala:75)
at FileStreaming$.main(FileStreaming.scala:15)
at FileStreaming.main(FileStreaming.scala)
The error message is very clear, you need to set app name in spark conf objet.
Replace
val conf = new SparkConf().setMaster("local[2]”)
to
val conf = new SparkConf().setMaster("local[2]”).setAppName(“MyApp")
Would suggest to read official Spark Programming Guide
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
The appName parameter is a name for your application to show on the
cluster UI. master is a Spark, Mesos or YARN cluster URL, or a special
“local” string to run in local mode.
Online documentation have lot of examples to get started.
Cheers !

error: value cassandraTable is not a member of org.apache.spark.SparkContext

I want to access Cassandra table in Spark. Below are the version that I am using
spark: spark-1.4.1-bin-hadoop2.6
cassandra: apache-cassandra-2.2.3
spark cassandra connector: spark-cassandra-connector-java_2.10-1.5.0-M2.jar
Below is the script:
sc.stop
import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org.apache.spark.SparkConf
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val sc = new SparkContext(conf)
val test_spark_rdd = sc.cassandraTable("test1", "words")
when i run the last statement i get an error
:32: error: value cassandraTable is not a member of
org.apache.spark.SparkContext
val test_spark_rdd = sc.cassandraTable("test1", "words")
hints to resolve the error would be helpful.
Thanks
Actually on shell you need to import respective packages. No need to do anything extra.
e.g. scala> import com.datastax.spark.connector._;

How to stop a StreamingContext in Apache Spark on Zeppelin

import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.eventhubs.EventHubsUtils
import sqlContext.implicits._
val ehParams = Map[String, String](
"eventhubs.policyname" -> "Full",
...
)
val ssc = new StreamingContext(sc, Seconds(2))
val stream = EventHubsUtils.createUnionStream(ssc, ehParams)
val cr = stream.window(Seconds(6))
case class Message(msg: String)
stream.map(msg=>Message(new String(msg))).foreachRDD(rdd=>rdd.toDF().registerTempTable("temp"))
stream.print
ssc.start
This above starts and runs fine but I cannot seem to stop it. Any call to %sql show tables will just freeze.
How do i stop the StreamingContext above ?
ssc.stop also kills the Spark Context, requiring an interpreter restart.
Use ssc.stop(stopSparkContext=false, stopGracefully=true) instead.
ssc.stop in a new paragraph should stop it
There is also an ongoing discussion in the dev# mailing list on how to improve integration with streaming platforms.

Resources