Error while instantiating * org. apache. spark. sql. hive. HiveACLSessi onStateBuilder’ - apache-spark

The problem is that the pyspark script runs fine in one cluster, the error occurs when I run the same pyspark script in another yarn cluster. I guess the spark environment configurations are different between two cluster.
Here is the code to initialize the spark session.
from pyspark.sql import HiveContext
from pyspark.sql.functions import *
from pyspark.sql.session import SparkSession
sc = SparkContext()
spark = SparkSession(sc)
hive_context = HiveContext(sc)
error
Error while instantiating * org. apache. spark. sql. hive. HiveACLSessi onStateBuilder’

Related

spark variable in pyspark vs SparkSession

When we start pyspark (spark 2.4), it comes with a spark variable call underline functionality.
so when to call and use SparkSession and SparkContext methods if "spark" is already available.
Using spark 2.4, you probably see something like this in your log:
Spark context available as 'sc' (master = yarn, app id = application...).
Spark session available as 'spark'.
According to databricks blog:
In previous versions of Spark, you had to create a SparkConf and SparkContext to interact with Spark, as shown here:
//set up the spark configuration and create contexts
val sparkConf = new SparkConf().setAppName("SparkSessionZipsExample").setMaster("local")
// your handle to SparkContext to access other context like SQLContext
val sc = new SparkContext(sparkConf).set("spark.some.config.option", "some-value")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
Whereas in Spark 2.0 the same effects can be achieved through
SparkSession, without expliciting creating SparkConf, SparkContext or
SQLContext, as they’re encapsulated within the SparkSession.
So:
In your case spark is just an alias for the SparkSession.
You not need to use SparkContext as it is encapsulated within the SparkSession.

Error when creating sqlContext in Apache Spark

I am using Apache Spark and running it on Ipython notebook.
I am trying to convert a regular dataframe to Spark DataFrame. For that I need sqlContext. When I use it i get an error.
Error says:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':"
I have looked up multiple resources but am not able to solve this issue.
SQLContext used to be the entry point for the SQL functionality in Spark 1.x; in Spark 2 it has been replaced with SparkSession (documentation). So, here is the proper way to initialize Spark in version 2.2, which is the one you are using according to your screenshot:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
See the documentation on Spark SQL for further usage examples.

Specifiying custom profilers for pyspark running Spark 2.0

I would like to know how to specify a custom profiler class in PySpark for Spark version 2+. Under 1.6, I know I can do so like this:
sc = SparkContext('local', 'test', profiler_cls='MyProfiler')
but when I create the SparkSession in 2.0 I don't explicitly have access to
the SparkContext. Can someone please advise how to do this for Spark 2.0+ ?
SparkSession can be initialized with an existing SparkContext, for example:
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.profiler import BasicProfiler
spark = SparkSession(SparkContext('local', 'test', profiler_cls=BasicProfiler))

How to enable streaming from Cassandra to Spark?

I have the following spark job:
from __future__ import print_function
import os
import sys
import time
from random import random
from operator import add
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SQLContext, Row
from pyspark.streaming import StreamingContext
from pyspark_cassandra import streaming,CassandraSparkContext
if __name__ == "__main__":
conf = SparkConf().setAppName("PySpark Cassandra Test")
sc = CassandraSparkContext(conf=conf)
stream = StreamingContext(sc, 2)
rdd=sc.cassandraTable("keyspace2","users").collect()
#print rdd
stream.start()
stream.awaitTermination()
sc.stop()
When I run this, it gives me the following error:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: \
No output operations registered, so nothing to execute
the shell script I run:
./bin/spark-submit --packages TargetHolding:pyspark-cassandra:0.2.4 example
s/src/main/python/test/reading-cassandra.py
Comparing spark streaming with kafka, I have this line missing from the above code:
kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", {'topic':1})
where I'm actually using createStream but for cassandra, I can't see anything like this on the docs. How do I start the streaming between spark streaming and cassandra?
Versions:
Cassandra v2.1.12
Spark v1.4.1
Scala 2.10
To create DStream out of a Cassandra table, you can use a ConstantInputDStream providing the RDD created out of the Cassandra table as input. This will result in the RDD being materialized on each DStream interval.
Be warned that large tables or tables that continuously grow in size will negatively impact performance of your Streaming job.
See also: Reading from Cassandra using Spark Streaming for an example.

Use collect_list and collect_set in Spark SQL

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image.
I'm trying to do this in Scala:
import org.apache.spark.sql.functions._
df.groupBy("column1")
.agg(collect_set("column2"))
.show()
And receive the following error at runtime:
Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set;
Also tried it using pyspark, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions.
How to fix this? Thanx!
Spark 2.0+:
SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required.
Spark 2.0-SNAPSHOT (before 2016-05-03):
You have to enable Hive support for a given SparkSession:
In Scala:
val spark = SparkSession.builder
.master("local")
.appName("testing")
.enableHiveSupport() // <- enable Hive support.
.getOrCreate()
In Python:
spark = (SparkSession.builder
.enableHiveSupport()
.getOrCreate())
Spark < 2.0:
To be able to use Hive UDFs (see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext.
In Scala:
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SQLContext
val sqlContext: SQLContext = new HiveContext(sc)
In Python:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

Resources