Error initializing SparkContext when using SPARK-SHELL in spark standalone - apache-spark

I have Installed Scala.
I have installed java 8.
Also all environment variables has been set for spark,java and Hadoop.
Still getting this error while running spark-shell command. Please someone help....google it a lot but didn't find anything.
spark-shell error
spark shell error2

Spark’s shell provides a simple way to learn the API, Start shell by running the following in the Spark directory:
./bin/spark-shell
Then run below scala code snippet:
import org.apache.spark.sql.SparkSession
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
If stills error persist,then we have to look into environment set up

Related

exception:java.lang.IllegalArgumentException:Wrong FS: abfs://container#storageaccount.dfs.core.windows.net/folder, expected: hdfs://master-node:port

I am trying to execute a scala file using Spark Submit in CDP CDH 7.2.9 hosted on Azure Platform. But I am getting below error
User class threw exception:java.lang.IllegalArgumentException:Wrong FS: abfs://container#storageaccount.dfs.core.windows.net/folder, expected: hdfs://master-node:port at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:773)
I have tried below options as suggested in stackoverflow but no luck !
a.conf.spark.hadoop.fs.defaultFS = “abfs://container#storageaccount.dfs.core.windows.net”
b.val hdfs = FileSystem.get(new java.net.URI(s"abfs://container#storageaccount.dfs.core.windows.net"), spark.sparkContext.hadoopConfiguration) in scala program.
c.val adlsURI = s"abfs://container#storageaccount.dfs.core.windows.net" FileSystem.setDefaultUri = (spark.sparkContext.hadoopConfiguration, new java.net.URI(adlsURI)) val hdfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
in Scala program.
d.conf.spark.yarn.access.hadoopFileSystems=abfs://container#storageaccount.dfs.core.windows.net
I am using Spark 2.4 version.

Difference between spark_session and sqlContext on loading a local file

I'm tried to load a local file as dataframe with using spark_session and sqlContext.
df = spark_session.read...load(localpath)
It couldn't read local files. df is empty.
But, after creating sqlcontext from spark_context, it could load a local file.
sqlContext = SQLContext(spark_context)
df = sqlContext.read...load(localpath)
It worked fine. But I can't understand why. What is the cause ?
Envionment: Windows10, spark 2.2.1
EDIT
Finally I've resolved this problem. The root cause is version difference between PySpark installed with pip and PySpark installed in local file system. PySpark failed to start because of py4j failing.
I am pasting a sample code that might help. We have used this to create a Sparksession object and read a local file with it:
import org.apache.spark.sql.SparkSession
object SetTopBox_KPI1_1 {
def main(args: Array[String]): Unit = {
if(args.length < 2) {
System.err.println("SetTopBox Data Analysis <Input-File> OR <Output-File> is missing")
System.exit(1)
}
val spark = SparkSession.builder().appName("KPI1_1").getOrCreate()
val record = spark.read.textFile(args(0)).rdd
.....
On the whole, in Spark 2.2 the preferred way to use Spark is by creating a SparkSession object.

Why am I able to map a RDD with the SparkContext

The SparkContext is not serializable. It is meant to be used on the driver, thus can someone explain the following ?
Using the spark-shell, on yarn, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Nothing happens on the client (prints executors-side)
Using the spark-shell, local master, and Spark version 1.6.0
val rdd = sc.parallelize(Seq(1))
rdd.foreach(x => print(sc))
Prints "null" on the client
Using pyspark, local master, and Spark version 1.6.0
rdd = sc.parallelize([1])
def _print(x):
print(x)
rdd.foreach(lambda x: _print(sc))
Throws an Exception
I also tried the following :
Using the spark-shell, and Spark version 1.6.0
class Test(val sc:org.apache.spark.SparkContext) extends Serializable{}
val test = new Test(sc)
rdd.foreach(x => print(test))
Now it finally throws a java.io.NotSerializableException: org.apache.spark.SparkContext
Why does it works in Scala when I only print sc ? Why do I have a null reference when it should have thrown a NotSerializableException (or so I thought ...)

Getting parameters of Spark submit while running a Spark job

I am running a spark job by spark-submit and using its --files parameter to load a log4j.properties file.
In my Spark job I need to get this parameter
object LoggerSparkUsage {
def main(args: Array[String]): Unit = {
//DriverHolder.log.info("unspark")
println("args are....."+args.mkString(" "))
val conf = new SparkConf().setAppName("Simple_Application")//.setMaster("local[4]")
val sc = new SparkContext(conf)
// conf.getExecutorEnv.
val count = sc.parallelize(Array(1, 2, 3)).count()
println("these are files"+conf.get("files"))
LoggerDriver.log.info("log1 for info..")
LoggerDriver.log.info("log2 for infor..")
f2
}
def f2{LoggerDriver.log.info("logs from another function..")}
}
my spark submit is something like this:
/opt/mapr/spark/spark-1.6.1/bin/spark-submit --class "LoggerSparkUsage" --master yarn-client --files src/main/resources/log4j.properties /mapr/cellos-mapr/user/mbazarganigilani/SprkHbase/target/scala-2.10/sprkhbase_2.10-1.0.2.jar
I tried to get the properties using
conf.get("files")
but it gives me an exception
can anyone give me a solution for this?
A correct key for files is spark.files:
scala.util.Try(sc.getConf.get("spark.files"))
but to get actual path on the workers you have to use SparkFiles:
org.apache.spark.SparkFiles.get(fileName)
If it is not sufficient you can pass these second as application arguments and retrieve from main args or use custom key in spark.conf.

Parquet file in Spark SQL

I am trying to use Spark SQL using parquet file formats. When I try the basic example :
object parquet {
case class Person(name: String, age: Int)
def main(args: Array[String]) {
val sparkConf = new SparkConf().setMaster("local").setAppName("HdfsWordCount")
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD
val people = sc.textFile("C:/Users/pravesh.jain/Desktop/people/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))
people.saveAsParquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
val parquetFile = sqlContext.parquetFile("C:/Users/pravesh.jain/Desktop/people/people.parquet")
}
}
I get a null pointer exception :
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.parquet$.main(parquet.scala:16)
which is the line saveAsParquetFile. What's the issue here?
This error occurs when I was using Spark in eclipse in Windows. I tried the same on spark-shell and it works fine. I guess spark might not be 100% compatible with windows.
Spark is compatible with Windows. You can run your program in a spark-shell session in Windows or you can run it using spark-submit with necessary argument such as "-master" (again, in Windows or other OS).
You cannot just run your Spark program as an ordinary Java program in Eclispe without properly setting up the Spark environment and so on. You problem has nothing to do with Windows.

Resources