SparkR logging method - apache-spark

What is the logging method for SparkR, the equivalent of SparklyR spark_log(sc, -1) method
E.g:get the SparklyR and in SparkR context
sc <- get_spark_connection()
sparkR.session()
#set the logging definition in SparkyR and SparkR
setup_logging(spark = "INFO", megdp = "INFO")
setLogLevel("ERROR")
How to get the logging object in SparkR please ?
Below the method in SparklyR
`logs <- spark_log(sc, -1)`

Related

Getting exception "No output operations registered, so nothing to execute" from Spark Streaming

package com.scala.sparkStreaming
import org.apache.spark._
import org.apache.spark.streaming._
object Demo1 {
def main(assdf:Array[String]){
val sc=new SparkContext("local","Stream")
val stream=new StreamingContext(sc,Seconds(2))
val rdd1=stream.textFileStream("D:/My Documents/Desktop/inbound/sse/ssd/").cache()
val mp1= rdd1.flatMap(_.split(","))
print(mp1.count())
stream.start()
stream.awaitTermination()
}
}
I had run it, then it shows an exception
org.apache.spark.streaming.dstream.MappedDStream#6342993220/05/22 18:14:16 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:277)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:169)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:517)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:577)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:576)
at com.scala.sparkStreaming.Demo1$.main(Demo1.scala:18)
at com.scala.sparkStreaming.Demo1.main(Demo1.scala)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:277)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:169)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:517)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:577)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:576)
at com.scala.sparkStreaming.Demo1$.main(Demo1.scala:18)
at com.scala.sparkStreaming.Demo1.main(Demo1.scala)
The error message "No output operations registered, so nothing to execute" gives a hint that something is missing.
Your Direct Streams rdd1 and mp1 do not have any Action. A flatMap is only a Transformation which gets lazily evaluated by Spark. That is why the stream.start() method throws this Exception.
According to the documentation, you can print out an RDD as shown below. As you are dealing with a DStream you could iterator through the RDDs. Below code runs fine with Spark version 2.4.5.
The documentation of textFileStream says that it "monitors a Hadoop-compatible filesystem for new files and reads them as text files", so make sure that you add/modify the file you want to read while the job is running.
Also, although I am not completely familiar with Spark on Windows, you may need to change the directory string to
file://D:\\My Documents\\Desktop\\inbound\\sse\\ssd
Here is the full code example for Spark Streaming:
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Main extends App {
val sc=new SparkContext("local[1]","Stream")
val stream=new StreamingContext(sc,Seconds(2))
val rdd1 =stream.textFileStream("file:///path/to/src/main/resources")
val mp1= rdd1.flatMap(_.split(" "))
mp1.foreachRDD(rdd => rdd.collect().foreach(println(_)))
stream.start()
stream.awaitTermination()
}
In Spark version 2.4.5 Spark Streaming is deprecated and I would suggest to get familiar already with Spark Structured Streaming. The code for that would look something like this:
// Structured Streaming
val lines: DataFrame = spark.readStream
.format("text")
.option("path", "file://path/to/src/main/resources")
.option("maxFilesPerTrigger", "1")
.load()
val query = lines.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()

Why am I able to map a RDD with the SparkContext

The SparkContext is not serializable. It is meant to be used on the driver, thus can someone explain the following ?
Using the spark-shell, on yarn, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Nothing happens on the client (prints executors-side)
Using the spark-shell, local master, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Prints "null" on the client
Using pyspark, local master, and Spark version 1.6.0
rdd = sc.parallelize([1])
def _print(x):
print(x)
rdd.foreach(lambda x: _print(sc))
Throws an Exception
I also tried the following :
Using the spark-shell, and Spark version 1.6.0
class Test(val sc:org.apache.spark.SparkContext) extends Serializable{}
val test = new Test(sc)
rdd.foreach(x => print(test))
Now it finally throws a java.io.NotSerializableException: org.apache.spark.SparkContext
Why does it works in Scala when I only print sc ? Why do I have a null reference when it should have thrown a NotSerializableException (or so I thought ...)

WriteConf of Spark-Cassandra Connector being used or not

I am using Spark version 1.6.2, Spark-Cassandra Connector 1.6.0, Cassandra-Driver-Core 3.0.3
I am writing a simple Spark job in which I am trying to insert some rows to a table in Cassandra. The code snippet used was:
val sparkConf = (new SparkConf(true).set("spark.cassandra.connection.host", "<Cassandra IP>")
.set("spark.cassandra.auth.username", "test")
.set("spark.cassandra.auth.password", "test")
.set("spark.cassandra.output.batch.size.rows", "1"))
val sc = new SparkContext(sparkConf)
val cassandraSQLContext = new CassandraSQLContext(sc)
cassandraSQLContext.setKeyspace("test")
val query = "select * from test"
val dataRDD = cassandraSQLContext.cassandraSql(query).rdd
val addRowList = (ListBuffer(
Test(111, 10, 100000, "{'test':'0','test1':'1','others':'2'}"),
Test(111, 20, 200000, "{'test':'0','test1':'1','others':'2'}")
))
val insertRowRDD = sc.parallelize(addRowList)
insertRowRDD.saveToCassandra("test", "test")
Test() is a case class
Now, I have passed the WriteConf parameter output.batch.size.rows when making sparkConf object. I am expecting that this code will write 1 row in a batch at a time in Cassandra. I am not getting any method through which I can cross verify that the configuration of writing a batch in cassandra is not the default one but the one passed in the code snippet.
I could not find anything in the cassandra cassandra.log, system.log and debug.log
So can anyone help me with the method of cross verifying the WriteConf being used by Spark-Cassandra Connector to write batches in Cassandra?
There are two things you can do to verify that your setting was correctly set.
First you can call the method which creates WriteConf
WriteConf.fromSparkConf(sparkConf)
The resulting object can be inspected to make sure all the values are what you want. This is the default arg to SaveToCassandra
You can explicitly pass a WriteConf to the saveToCassandraMethod
saveAsCassandraTable(keyspace, table, writeConf = WriteConf(...))

Convert DStreams to RDDs

How can I convert Spark Streaming DStream to RDDs so that be used inside Spark Context & not inside Streaming Context? Using Python.
Here is the error I am getting:
AttributeError: 'TransformedDStream' object has no attribute 'foreach'
def result(y):
return y
d_stream = input_stream.map(lambda x : mainReduce(x)).map(lambda q: q)
rdd = d_stream.foreach(result)

Error in callJMethod(sqlContext, "parquetFile", paths) : Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed

I want to run sparkR on yarn-client through SparkR shell.So I do the like that:
./sparkR
sparkR.stop();
sc <- sparkR.init(master="yarn-client",appName="SparkR-Parquet-example2", sparkHome = Sys.getenv("SPARK_HOME"),sparkExecutorEnv = list(HADOOP_CONF_DIR=”/etc/hadoop/conf.cloudera.yarn”,YARN_CONF_DIR=”/etc/hadoop/conf.cloudera.yarn”))
sqlContext <- sparkRSQL.init(sc)
path<-"hdfs://year=2015/month=1/day=9"
AppDF <- parquetFile(sqlContext, path)
Error in callJMethod(sqlContext, "parquetFile", paths) :
Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed.
I'm new to spark, could anyone help to solve it ?
I'm using spark-1.4.0-bin-hadoop2.6

Resources