New SQLContext: Spark 1.6 backward-compatibility with Spark 2.1 - apache-spark

On IBM DSX I have the following problem.
For the Spark 1.6 kernels on DSX it was/is necessary to create new SQLContext objects in order to avoid issues with the metastore_db and HiveContext : http://stackoverflow.com/questions/38117849/you-must-build-spark-with-hive-export-spark-hive-true/38118112#38118112
The following code snippets were implemented using Spark 1.6 and both run for Spark 2.0.2, but not for Spark 2.1:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([(1, "a"), (2, "b"), (3, "c"), (4, "d")], ("k", "v"))
df.count()
and
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
properties= {
'jdbcurl': 'JDBCURL',
'user': 'USER',
'password': 'PASSWORD!'
}
data_df_1 = sqlContext.read.jdbc(properties['jdbcurl'], table='GOSALES.BRANCH', properties=properties)
data_df_1.head()
I get this error:
IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
However, when I execute the same code a second time it works again.

Instead of creating a new SQLContext using SQLContext(sc) you can use SQLContext.getOrCreate(sc). This will return an existing SQLContext if it exists.

IIRC, creating a new SQLContext was only necessary for the legacy Spark services (bluemix_ipythonspark_16) in Bluemix. DSX supports only the newer services (bluemix_jupyter_bundle), where creating a new SQLContext is more likely to create problems with Hive than to solve them. Please try without.

Related

PySpark Cassandra Databese Connection Problem

I am trying to use cassandra with pyspark. I can make a remote connection to Spark Server properly. But the stage of read cassandra table, I am in trouble. I tried all of datastax connectors, i changed Spark configs(core, memory, etc) but I couldnt accomplish it. (The comment rows in below code are my tries.)
Here is my python codes;
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271"
os.environ['HADOOP_HOME']="E:\etc\spark-3.0.1-bin-hadoop2.7"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/bin/python3.7"
os.environ['PYSPARK_PYTHON']="/usr/local/bin/python3.7"
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=XX.XX.XX.XX spark.cassandra.auth.username=username spark.cassandra.auth.password=passwd pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars .ivy2\jars\spark-cassandra-connector-driver_2.12-3.0.0-alpha2.jar pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 pyspark-shell'
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("spark://YY.YY.YY:7077").setAppName("My app")
conf.set("spark.shuffle.service.enabled", "false")
conf.set("spark.dynamicAllocation.enabled","false")
conf.set("spark.executor.cores", "2")
conf.set("spark.executor.memory", "5g")
conf.set("spark.executor.instances", "1")
conf.set("spark.jars", "C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector_2.12-3.0.0-beta.jar")
conf.set("spark.cassandra.connection.host","XX.XX.XX.XX")
conf.set("spark.cassandra.auth.username","username")
conf.set("spark.cassandra.auth.password","passwd")
conf.set("spark.cassandra.connection.port", "9042")
# conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
sc = SparkContext(conf=conf)
# sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
# It works well until now
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.option("keyspace",keys_space_name)\
.option("table",table_name)\
.load()
return table_df
movies = load_and_get_table_df("weather", "currentweatherconditions")
The error I get is;
Someone have any idea with that?
This happens because you're specifying only spark.jars property, and pointing to the single jar. But spark cassandra connector depends on the number of the additional jars that aren't included into that list. I recommend instead either use spark.jars.packages with coordinate com.datastax.spark:spark-cassandra-connector_2.12:3.0.0, or specify in spark.jars the path to the assembly jar that has all necessary dependencies.
btw, 3.0 was release several months ago - why are you still using beta?

How to have more StreamingContexts in a single Spark application?

I'm trying to run a spark streaming job on the spark-shell, localhost. Following the code from here this is what I first tried:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(30))
Which gives following errors:
org.apache.spark.SparkException: Only one SparkContext may be running in
this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true.
And so I had to try this:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName
("NetworkWordCount").set("spark.driver.allowMultipleContexts", "true")
val ssc = new StreamingContext(conf, Seconds(30))
This runs but with the following warning:
2018-05-17 17:01:14 WARN SparkContext:87 - Multiple running SparkContexts detected in the same JVM!
So I'd like to know if there might be another way of declaring a StreamingContext object that does not require allowMulipleContexts == True as it appears using multiple contexts is discouraged? Thanks
You need to use the existing SparkContext sc to create the StreamingContext
val ssc = new StreamingContext(sc, Seconds(30))
When you create it using the alternate constructor i.e the one with SparkConf it internally creates another SparkContext and that's why you get that Warning.

how to write spark context in java in spark 1.6.3 , so that to convert file into dataframe.

i have tried like this. but no luck.File1 and file2 are in my local machine. Not in the hdfs. Please help.
SparkConf sparkConf = new SparkConf().setAppName("sample");
SparkContext sc = new SparkContext(sparkConf);
SQLContext sqlContext = SQLContext.getOrCreate(sc);
val file1=sc.textFile("file1.txt", minPartitions);
val file2=sc.textFile("file2.txt", minPartitions);

Getting the hivecontext from a dataframe

I am creating a hivecontext instead of sqlcontext to create adtaframe
val conf=new SparkConf().setMaster("yarn-cluster")
val context=new SparkContext(conf)
//val sqlContext=new SQLContext(context)
val hiveContext=new HiveContext(context)
val data=Seq(1,2,3,4,5,6,7,8,9,10).map(x=>(x.toLong,x+1,x+2.toDouble)).toDF("ts","value","label")
//outdta is a dataframe
data.registerTempTable("df")
//val hiveTest=hiveContext.sql("SELECT * from df where ts < percentile(BIGINT ts, 0.5)")
val ratio1=hiveContext.sql("SELECT percentile_approx(ts, array (0.5,0.7)) from df")
I need to get the exact hive context from ratio1 and not again create a hivecontext from the povidedsql context in the dataframe, I don't know why spark don't give me a hivecontext from dataframe and it just gives sqlcontext.
If you use HiveCOntext, then the runtime-type of df.sqlContext is HiveContext (HiveContext is a subtype of SQLContext), therefore you can do:
val hiveContext = df.sqlContext.asInstanceOf[HiveContext]

Why spark tell me “ name 'sqlContext' is not defined ”, how can I use sqlContext?

I try to run example of spark-ml, but
from pyspark import SparkContext
import pyspark.sql
sc = SparkContext(appName="PythonStreamingQueueStream")
training = sqlContext.createDataFrame([
(1.0, Vectors.dense([0.0, 1.1, 0.1])),
(0.0, Vectors.dense([2.0, 1.0, -1.0])),
(0.0, Vectors.dense([2.0, 1.3, 1.0])),
(1.0, Vectors.dense([0.0, 1.2, -0.5]))], ["label", "features"])
cannot run because terminal tells me that
NameError: name 'SQLContext' is not defined
Why this happened? How can I solve it?
If you are using Apache Spark 1.x line (i.e. prior to Apache Spark 2.0), to access the sqlContext, you would need to import the sqlContext; i.e.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
If you're using Apache Spark 2.0, you can just the Spark Session directly instead. Therefore your code will be
training = spark.createDataFrame(...)
For more information, please refer to the Spark SQL Programing Guide.
from pyspark.sql import SparkSession,SQLContext
spark = SparkSession.builder.appName("Basics").getOrCreate()
sc=spark.sparkContext
sqlContext = SQLContext(sc)
df = sqlContext.range(0,10)
Above piece of code will solve your issue.

Resources