NameError: name 'SparkSession' is not defined - apache-spark

I'm new to cask cdap and Hadoop environment.
I'm creating a pipeline and I want to use a PySpark Program. I have all the script of the spark program and it works when I test it by command like, insted it doesn't if I try to copy- paste it in a cdap pipeline.
It gives me an error in the logs:
NameError: name 'SparkSession' is not defined
My script starts in this way:
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import trim, to_date, year, month
sc= SparkContext()
How can I fix it?

Spark connects with the local running spark cluster through SparkContext. A better explanation can be found here https://stackoverflow.com/a/24996767/5671433.
To initialise a SparkSession, a SparkContext has to be initialized.
One way to do that is to write a function that initializes all your contexts and a spark session.
def init_spark(app_name, master_config):
"""
:params app_name: Name of the app
:params master_config: eg. local[4]
:returns SparkContext, SQLContext, SparkSession:
"""
conf = (SparkConf().setAppName(app_name).setMaster(master_config))
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sql_ctx = SQLContext(sc)
spark = SparkSession(sc)
return (sc, sql_ctx, spark)
This can then be called as
sc, sql_ctx, spark = init_spark("App_name", "local[4]")

Related

getting error while trying to read athena table in spark

I have the following code snippet in pyspark:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())
if __name__ == "__main__":
validate_data()
Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )
but when we are running as a python script on terminal, its throwing this error
Error message
AttributeError: 'SparkContext' object has no attribute 'read'
We have to automate these notebooks, so we are trying to convert them to python scripts
You can only call read on a Spark Session, not on a Spark Context.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)
Or you can convert the Spark context to a Spark session
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

PySpark Cassandra Databese Connection Problem

I am trying to use cassandra with pyspark. I can make a remote connection to Spark Server properly. But the stage of read cassandra table, I am in trouble. I tried all of datastax connectors, i changed Spark configs(core, memory, etc) but I couldnt accomplish it. (The comment rows in below code are my tries.)
Here is my python codes;
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271"
os.environ['HADOOP_HOME']="E:\etc\spark-3.0.1-bin-hadoop2.7"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/bin/python3.7"
os.environ['PYSPARK_PYTHON']="/usr/local/bin/python3.7"
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=XX.XX.XX.XX spark.cassandra.auth.username=username spark.cassandra.auth.password=passwd pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars .ivy2\jars\spark-cassandra-connector-driver_2.12-3.0.0-alpha2.jar pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 pyspark-shell'
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("spark://YY.YY.YY:7077").setAppName("My app")
conf.set("spark.shuffle.service.enabled", "false")
conf.set("spark.dynamicAllocation.enabled","false")
conf.set("spark.executor.cores", "2")
conf.set("spark.executor.memory", "5g")
conf.set("spark.executor.instances", "1")
conf.set("spark.jars", "C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector_2.12-3.0.0-beta.jar")
conf.set("spark.cassandra.connection.host","XX.XX.XX.XX")
conf.set("spark.cassandra.auth.username","username")
conf.set("spark.cassandra.auth.password","passwd")
conf.set("spark.cassandra.connection.port", "9042")
# conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
sc = SparkContext(conf=conf)
# sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
# It works well until now
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.option("keyspace",keys_space_name)\
.option("table",table_name)\
.load()
return table_df
movies = load_and_get_table_df("weather", "currentweatherconditions")
The error I get is;
Someone have any idea with that?
This happens because you're specifying only spark.jars property, and pointing to the single jar. But spark cassandra connector depends on the number of the additional jars that aren't included into that list. I recommend instead either use spark.jars.packages with coordinate com.datastax.spark:spark-cassandra-connector_2.12:3.0.0, or specify in spark.jars the path to the assembly jar that has all necessary dependencies.
btw, 3.0 was release several months ago - why are you still using beta?

Where to write the setup and teardown code for Locust tests?

I've exploring locust for our load testing requirements for Spark but stuck on some very basic tasks; documentation also seems very limited.
Stuck on how/where to write my setup & tear-down code that needs to run only once regardless of the number of users. Tried with below sample given in docs; but the code written under events.test_start doesn't run it seems as I'm unable to use attribute 'sc' anywhere under SparkJob class. Any idea how to access the spark instances created under on_test_start method in my SparkJob class?
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
#spark = SparkSession(sc)
return sc
#events.test_stop.add_listener
def on_test_stop(**kw):
#spark.stop()
sc.stop()
I don't know anything about Spark, but making sc or spark a global variable should work for you. So something like:
from locust import User, TaskSet, task, between
from locust import events
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
spark: SparkSession = None
class SparkJob(TaskSet):
#task
def submit_jobs(self):
# sample spark job
spark.do_stuff()
class SparkUser(User):
host = xxx
wait_time = xxx
tasks = [SparkJob]
#events.test_start.add_listener
def on_test_start(**kw):
global spark
conf = SparkConf().setAppName(conn_st['app'])
sc = SparkContext(master=conn_st['master'], conf=conf)
spark = SparkSession(sc)
#events.test_stop.add_listener
def on_test_stop(**kw):
spark.stop()
You can look more into Python global variables. In short, you only need global if you're going to assign it or change it, otherwise it should be able to infer the global for you. You can be explicit and add it in each place, though.

PySpark + jupyter notebook

I am trynig to configure a spark context into my notebook, but there is something wrong, I do :
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
if sc==sc:
sc.stop()
if spark==spark:
spark.stop()
conf = SparkConf()
conf = conf.setAppName(appName)
conf = conf.set("spark.master", master)
conf = conf.set("spark.python.worker.memory", "1042M")
spark.stop()
session_builder = SparkSession.builder
session_builder = session_builder.master(master)
spark = session_builder.getOrCreate()
and this give me an error :
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Can we change the configuration of spark in a jupyter notebook ?
And how ?
I am on the last version of spark with a standalone cluster.
Following the propose action I did :
which seems to mean the spark Context has been recreated, but the sparSession is not linked to the new sc anymore.
Just use the config option when setting SparkSession (as of 2.4)
MAX_MEMORY = "5g"
spark = SparkSession \
.builder \
.appName("Foo") \
.config("spark.executor.memory", MAX_MEMORY) \
.config("spark.driver.memory", MAX_MEMORY) \
.getOrCreate()
From the code above, what I understand is sc is you sparkcontext and spark is your sparkSession variable. You are stopping both of them and then using spark.stop() again on an already terminated session. Instead use this:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = (SparkConf()
.setMaster("local")
.setAppName("App_name")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
You can find the documentation here: Pyspark
If you have configured your notebook with pyspark, you don't need to stop a spark context and create a new one. Instead you can you sc as you spark context. You can pass additional configurations via spark-submit as command line arguments. You can refer the configuration documentation here:Pyspark Configuration

MySQL read with PySpark

I have the following test code:
from pyspark import SparkContext, SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
print('Created spark context!')
if __name__ == '__main__':
df = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://localhost/mysql",
driver="com.mysql.jdbc.Driver",
dbtable="users",
user="user",
password="****",
properties={"driver": 'com.mysql.jdbc.Driver'}
).load()
print(df)
When I run it, I get the following error:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
In Scala, this is solved by importing the .jar mysql-connector-java into the project.
However, in python I have no idea how to tell the pyspark module to link the mysql-connector file.
I have seen this solved with examples like
spark --package=mysql-connector-java testfile.py
But I don't want this since it forces me to run my script in a weird way. I would like an all python solution or copy a file somewhere or, add something to the Path.
You can pass arguments to spark-submit when creating your sparkContext before SparkConf is initialized:
import os
from pyspark import SparkConf, SparkContext
SUBMIT_ARGS = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)
or you can add them to your $SPARK_HOME/conf/spark-defaults.conf
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost/database_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file
If you are using pycharm and want to run line by line instead of submitting your .py through spark-submit, you can copy your .jar to c:\spark\jars\ and your code could be like:
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()

Resources