PySpark + jupyter notebook - apache-spark

I am trynig to configure a spark context into my notebook, but there is something wrong, I do :
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
if sc==sc:
sc.stop()
if spark==spark:
spark.stop()
conf = SparkConf()
conf = conf.setAppName(appName)
conf = conf.set("spark.master", master)
conf = conf.set("spark.python.worker.memory", "1042M")
spark.stop()
session_builder = SparkSession.builder
session_builder = session_builder.master(master)
spark = session_builder.getOrCreate()
and this give me an error :
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Can we change the configuration of spark in a jupyter notebook ?
And how ?
I am on the last version of spark with a standalone cluster.
Following the propose action I did :
which seems to mean the spark Context has been recreated, but the sparSession is not linked to the new sc anymore.

Just use the config option when setting SparkSession (as of 2.4)
MAX_MEMORY = "5g"
spark = SparkSession \
.builder \
.appName("Foo") \
.config("spark.executor.memory", MAX_MEMORY) \
.config("spark.driver.memory", MAX_MEMORY) \
.getOrCreate()

From the code above, what I understand is sc is you sparkcontext and spark is your sparkSession variable. You are stopping both of them and then using spark.stop() again on an already terminated session. Instead use this:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = (SparkConf()
.setMaster("local")
.setAppName("App_name")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
You can find the documentation here: Pyspark
If you have configured your notebook with pyspark, you don't need to stop a spark context and create a new one. Instead you can you sc as you spark context. You can pass additional configurations via spark-submit as command line arguments. You can refer the configuration documentation here:Pyspark Configuration

Related

Errors Loading csv file with spark-submit

I am new to py spark and I have been running jobs on Jupiter notebook which is running smoothly but having issues running spark-submit for loading a CSV file.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
load csv file
netflix_df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema","true") \
.load("netflix_titles.csv")
The above code works perfectly on Jupiter notebook but doesn't work when trying to run the same code saved in a python file with spark-submit
I get the following errors
NameError: name 'spark' is not defined
when i replace spark.read.format("csv") with sc.read.format("csv")
I get the following error
AttributeError: 'SparkContext' object has no attribute 'read'
You need to create a spark session.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder()
.master("local[1]") # replace with suitable parameter
.appName("demo")
.getOrCreate()
#now you use spark.read

PySpark Cassandra Databese Connection Problem

I am trying to use cassandra with pyspark. I can make a remote connection to Spark Server properly. But the stage of read cassandra table, I am in trouble. I tried all of datastax connectors, i changed Spark configs(core, memory, etc) but I couldnt accomplish it. (The comment rows in below code are my tries.)
Here is my python codes;
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271"
os.environ['HADOOP_HOME']="E:\etc\spark-3.0.1-bin-hadoop2.7"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/bin/python3.7"
os.environ['PYSPARK_PYTHON']="/usr/local/bin/python3.7"
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=XX.XX.XX.XX spark.cassandra.auth.username=username spark.cassandra.auth.password=passwd pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars .ivy2\jars\spark-cassandra-connector-driver_2.12-3.0.0-alpha2.jar pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 pyspark-shell'
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("spark://YY.YY.YY:7077").setAppName("My app")
conf.set("spark.shuffle.service.enabled", "false")
conf.set("spark.dynamicAllocation.enabled","false")
conf.set("spark.executor.cores", "2")
conf.set("spark.executor.memory", "5g")
conf.set("spark.executor.instances", "1")
conf.set("spark.jars", "C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector_2.12-3.0.0-beta.jar")
conf.set("spark.cassandra.connection.host","XX.XX.XX.XX")
conf.set("spark.cassandra.auth.username","username")
conf.set("spark.cassandra.auth.password","passwd")
conf.set("spark.cassandra.connection.port", "9042")
# conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
sc = SparkContext(conf=conf)
# sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
# It works well until now
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.option("keyspace",keys_space_name)\
.option("table",table_name)\
.load()
return table_df
movies = load_and_get_table_df("weather", "currentweatherconditions")
The error I get is;
Someone have any idea with that?
This happens because you're specifying only spark.jars property, and pointing to the single jar. But spark cassandra connector depends on the number of the additional jars that aren't included into that list. I recommend instead either use spark.jars.packages with coordinate com.datastax.spark:spark-cassandra-connector_2.12:3.0.0, or specify in spark.jars the path to the assembly jar that has all necessary dependencies.
btw, 3.0 was release several months ago - why are you still using beta?

NameError: name 'SparkSession' is not defined

I'm new to cask cdap and Hadoop environment.
I'm creating a pipeline and I want to use a PySpark Program. I have all the script of the spark program and it works when I test it by command like, insted it doesn't if I try to copy- paste it in a cdap pipeline.
It gives me an error in the logs:
NameError: name 'SparkSession' is not defined
My script starts in this way:
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import trim, to_date, year, month
sc= SparkContext()
How can I fix it?
Spark connects with the local running spark cluster through SparkContext. A better explanation can be found here https://stackoverflow.com/a/24996767/5671433.
To initialise a SparkSession, a SparkContext has to be initialized.
One way to do that is to write a function that initializes all your contexts and a spark session.
def init_spark(app_name, master_config):
"""
:params app_name: Name of the app
:params master_config: eg. local[4]
:returns SparkContext, SQLContext, SparkSession:
"""
conf = (SparkConf().setAppName(app_name).setMaster(master_config))
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sql_ctx = SQLContext(sc)
spark = SparkSession(sc)
return (sc, sql_ctx, spark)
This can then be called as
sc, sql_ctx, spark = init_spark("App_name", "local[4]")

sc is not defined in SparkContext

My Spark package is spark-2.2.0-bin-hadoop2.7.
I exported spark variables as
export SPARK_HOME=/home/harry/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
I opened spark notebook by
pyspark
I am able to load packages from spark
from pyspark import SparkContext, SQLContext
from pyspark.ml.regression import LinearRegression
print(SQLContext)
output is
<class 'pyspark.sql.context.SQLContext'>
But my error is
print(sc)
"sc is undefined"
plz can anyone help me out ...!
In pysparkShell, SparkContext is already initialized as SparkContext(app=PySparkShell, master=local[*]) so you just need to use getOrCreate() to set the SparkContext to a variable as
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
For coding purpose in simple local mode, you can do the following
from pyspark import SparkConf, SparkContext, SQLContext
conf = SparkConf().setAppName("test").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print(sc)
print(sqlContext)

MySQL read with PySpark

I have the following test code:
from pyspark import SparkContext, SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
print('Created spark context!')
if __name__ == '__main__':
df = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://localhost/mysql",
driver="com.mysql.jdbc.Driver",
dbtable="users",
user="user",
password="****",
properties={"driver": 'com.mysql.jdbc.Driver'}
).load()
print(df)
When I run it, I get the following error:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
In Scala, this is solved by importing the .jar mysql-connector-java into the project.
However, in python I have no idea how to tell the pyspark module to link the mysql-connector file.
I have seen this solved with examples like
spark --package=mysql-connector-java testfile.py
But I don't want this since it forces me to run my script in a weird way. I would like an all python solution or copy a file somewhere or, add something to the Path.
You can pass arguments to spark-submit when creating your sparkContext before SparkConf is initialized:
import os
from pyspark import SparkConf, SparkContext
SUBMIT_ARGS = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)
or you can add them to your $SPARK_HOME/conf/spark-defaults.conf
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost/database_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file
If you are using pycharm and want to run line by line instead of submitting your .py through spark-submit, you can copy your .jar to c:\spark\jars\ and your code could be like:
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()

Resources