Parquet file in Spark SQL - apache-spark

I am trying to use Spark SQL using parquet file formats. When I try the basic example :
object parquet {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
val people = sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
val parquetFile = sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
}
}
I get a null pointer exception :
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.parquet$.main(parquet.scala:16)
which is the line saveAsParquetFile. What's the issue here?

This error occurs when I was using Spark in eclipse in Windows. I tried the same on spark-shell and it works fine. I guess spark might not be 100% compatible with windows.

Spark is compatible with Windows. You can run your program in a spark-shell session in Windows or you can run it using spark-submit with necessary argument such as "-master" (again, in Windows or other OS).
You cannot just run your Spark program as an ordinary Java program in Eclispe without properly setting up the Spark environment and so on. You problem has nothing to do with Windows.

Related

Getting exception "No output operations registered, so nothing to execute" from Spark Streaming

package com.scala.sparkStreaming
import org.apache.spark._
import org.apache.spark.streaming._
object Demo1 {
def main(assdf:Array[String]){
val sc=new SparkContext("local","Stream")
val stream=new StreamingContext(sc,Seconds(2))
val rdd1=stream.textFileStream("D:/My Documents/Desktop/inbound/sse/ssd/").cache()
val mp1= rdd1.flatMap(_.split(","))
print(mp1.count())
stream.start()
stream.awaitTermination()
}
}
I had run it, then it shows an exception
org.apache.spark.streaming.dstream.MappedDStream#6342993220/05/22 18:14:16 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:277)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:169)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:517)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:577)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:576)
at com.scala.sparkStreaming.Demo1$.main(Demo1.scala:18)
at com.scala.sparkStreaming.Demo1.main(Demo1.scala)
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:277)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:169)
at org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:517)
at org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:577)
at org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:576)
at com.scala.sparkStreaming.Demo1$.main(Demo1.scala:18)
at com.scala.sparkStreaming.Demo1.main(Demo1.scala)
The error message "No output operations registered, so nothing to execute" gives a hint that something is missing.
Your Direct Streams rdd1 and mp1 do not have any Action. A flatMap is only a Transformation which gets lazily evaluated by Spark. That is why the stream.start() method throws this Exception.
According to the documentation, you can print out an RDD as shown below. As you are dealing with a DStream you could iterator through the RDDs. Below code runs fine with Spark version 2.4.5.
The documentation of textFileStream says that it "monitors a Hadoop-compatible filesystem for new files and reads them as text files", so make sure that you add/modify the file you want to read while the job is running.
Also, although I am not completely familiar with Spark on Windows, you may need to change the directory string to
file://D:\\My Documents\\Desktop\\inbound\\sse\\ssd
Here is the full code example for Spark Streaming:
import org.apache.spark.SparkContext
import org.apache.spark.streaming.{Seconds, StreamingContext}
object Main extends App {
val sc=new SparkContext("local[1]","Stream")
val stream=new StreamingContext(sc,Seconds(2))
val rdd1 =stream.textFileStream("file:///path/to/src/main/resources")
val mp1= rdd1.flatMap(_.split(" "))
mp1.foreachRDD(rdd => rdd.collect().foreach(println(_)))
stream.start()
stream.awaitTermination()
}
In Spark version 2.4.5 Spark Streaming is deprecated and I would suggest to get familiar already with Spark Structured Streaming. The code for that would look something like this:
// Structured Streaming
val lines: DataFrame = spark.readStream
.format("text")
.option("path", "file://path/to/src/main/resources")
.option("maxFilesPerTrigger", "1")
.load()
val query = lines.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()

Difference between spark_session and sqlContext on loading a local file

I'm tried to load a local file as dataframe with using spark_session and sqlContext.
df = spark_session.read...load(localpath)
It couldn't read local files. df is empty.
But, after creating sqlcontext from spark_context, it could load a local file.
sqlContext = SQLContext(spark_context)
df = sqlContext.read...load(localpath)
It worked fine. But I can't understand why. What is the cause ?
Envionment: Windows10, spark 2.2.1
EDIT
Finally I've resolved this problem. The root cause is version difference between PySpark installed with pip and PySpark installed in local file system. PySpark failed to start because of py4j failing.
I am pasting a sample code that might help. We have used this to create a Sparksession object and read a local file with it:
import org.apache.spark.sql.SparkSession
object SetTopBox_KPI1_1 {
def main(args: Array[String]): Unit = {
if(args.length < 2) {
System.err.println("SetTopBox Data Analysis <Input-File> OR <Output-File> is missing")
System.exit(1)
}
val spark = SparkSession.builder().appName("KPI1_1").getOrCreate()
val record = spark.read.textFile(args(0)).rdd
.....
On the whole, in Spark 2.2 the preferred way to use Spark is by creating a SparkSession object.

Error initializing SparkContext when using SPARK-SHELL in spark standalone

I have Installed Scala.
I have installed java 8.
Also all environment variables has been set for spark,java and Hadoop.
Still getting this error while running spark-shell command. Please someone help....google it a lot but didn't find anything.
spark-shell error
spark shell error2
Spark’s shell provides a simple way to learn the API, Start shell by running the following in the Spark directory:
./bin/spark-shell
Then run below scala code snippet:
import org.apache.spark.sql.SparkSession
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
If stills error persist,then we have to look into environment set up

I'm trying to read the data from the files in the directory as soon as a new file is created. Real time "File Streaming"

I'm currently learning spark-streaming. I'm trying to read the data from the files in the directory as soon as a new file is created. Real time "File Streaming". I'm getting the below error. Can anyone suggest me a solution?
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
object FileStreaming {
def main(args:Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = ssc.textFileStream("C:\\Users\\PRAGI V\\Desktop\\data-
master\\data-master\\cards")
lines.flatMap(x => x.split(" ")).map(x => (x, 1)).print()
ssc.start()
ssc.awaitTermination()
}
}
Error:
Exception in thread "main" org.apache.spark.SparkException: An application
name must be set in your configuration
at org.apache.spark.SparkContext. <init>(SparkContext.scala:170)
at org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:555)
at org.apache.spark.streaming.StreamingContext.<init>
(StreamingContext.scala:75)
at FileStreaming$.main(FileStreaming.scala:15)
at FileStreaming.main(FileStreaming.scala)
The error message is very clear, you need to set app name in spark conf objet.
Replace
val conf = new SparkConf().setMaster("local[2]”)
to
val conf = new SparkConf().setMaster("local[2]”).setAppName(“MyApp")
Would suggest to read official Spark Programming Guide
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
The appName parameter is a name for your application to show on the
cluster UI. master is a Spark, Mesos or YARN cluster URL, or a special
“local” string to run in local mode.
Online documentation have lot of examples to get started.
Cheers !

Why am I able to map a RDD with the SparkContext

The SparkContext is not serializable. It is meant to be used on the driver, thus can someone explain the following ?
Using the spark-shell, on yarn, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Nothing happens on the client (prints executors-side)
Using the spark-shell, local master, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Prints "null" on the client
Using pyspark, local master, and Spark version 1.6.0
rdd = sc.parallelize([1])
def _print(x):
print(x)
rdd.foreach(lambda x: _print(sc))
Throws an Exception
I also tried the following :
Using the spark-shell, and Spark version 1.6.0
class Test(val sc:org.apache.spark.SparkContext) extends Serializable{}
val test = new Test(sc)
rdd.foreach(x => print(test))
Now it finally throws a java.io.NotSerializableException: org.apache.spark.SparkContext
Why does it works in Scala when I only print sc ? Why do I have a null reference when it should have thrown a NotSerializableException (or so I thought ...)

Resources