MySQL read with PySpark - python-3.x

I have the following test code:
from pyspark import SparkContext, SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
print('Created spark context!')
if __name__ == '__main__':
df = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://localhost/mysql",
driver="com.mysql.jdbc.Driver",
dbtable="users",
user="user",
password="****",
properties={"driver": 'com.mysql.jdbc.Driver'}
).load()
print(df)
When I run it, I get the following error:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
In Scala, this is solved by importing the .jar mysql-connector-java into the project.
However, in python I have no idea how to tell the pyspark module to link the mysql-connector file.
I have seen this solved with examples like
spark --package=mysql-connector-java testfile.py
But I don't want this since it forces me to run my script in a weird way. I would like an all python solution or copy a file somewhere or, add something to the Path.

You can pass arguments to spark-submit when creating your sparkContext before SparkConf is initialized:
import os
from pyspark import SparkConf, SparkContext
SUBMIT_ARGS = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)
or you can add them to your $SPARK_HOME/conf/spark-defaults.conf

from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost/database_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file

If you are using pycharm and want to run line by line instead of submitting your .py through spark-submit, you can copy your .jar to c:\spark\jars\ and your code could be like:
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()

Related

Errors Loading csv file with spark-submit

I am new to py spark and I have been running jobs on Jupiter notebook which is running smoothly but having issues running spark-submit for loading a CSV file.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
load csv file
netflix_df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema","true") \
.load("netflix_titles.csv")
The above code works perfectly on Jupiter notebook but doesn't work when trying to run the same code saved in a python file with spark-submit
I get the following errors
NameError: name 'spark' is not defined
when i replace spark.read.format("csv") with sc.read.format("csv")
I get the following error
AttributeError: 'SparkContext' object has no attribute 'read'
You need to create a spark session.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder()
.master("local[1]") # replace with suitable parameter
.appName("demo")
.getOrCreate()
#now you use spark.read

getting error while trying to read athena table in spark

I have the following code snippet in pyspark:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())
if __name__ == "__main__":
validate_data()
Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )
but when we are running as a python script on terminal, its throwing this error
Error message
AttributeError: 'SparkContext' object has no attribute 'read'
We have to automate these notebooks, so we are trying to convert them to python scripts
You can only call read on a Spark Session, not on a Spark Context.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)
Or you can convert the Spark context to a Spark session
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

PySpark Cassandra Databese Connection Problem

I am trying to use cassandra with pyspark. I can make a remote connection to Spark Server properly. But the stage of read cassandra table, I am in trouble. I tried all of datastax connectors, i changed Spark configs(core, memory, etc) but I couldnt accomplish it. (The comment rows in below code are my tries.)
Here is my python codes;
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_271"
os.environ['HADOOP_HOME']="E:\etc\spark-3.0.1-bin-hadoop2.7"
os.environ['PYSPARK_DRIVER_PYTHON']="/usr/local/bin/python3.7"
os.environ['PYSPARK_PYTHON']="/usr/local/bin/python3.7"
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=XX.XX.XX.XX spark.cassandra.auth.username=username spark.cassandra.auth.password=passwd pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars .ivy2\jars\spark-cassandra-connector-driver_2.12-3.0.0-alpha2.jar pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0-alpha2 pyspark-shell'
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setMaster("spark://YY.YY.YY:7077").setAppName("My app")
conf.set("spark.shuffle.service.enabled", "false")
conf.set("spark.dynamicAllocation.enabled","false")
conf.set("spark.executor.cores", "2")
conf.set("spark.executor.memory", "5g")
conf.set("spark.executor.instances", "1")
conf.set("spark.jars", "C:\\Users\\verianalizi\\.ivy2\\jars\\spark-cassandra-connector_2.12-3.0.0-beta.jar")
conf.set("spark.cassandra.connection.host","XX.XX.XX.XX")
conf.set("spark.cassandra.auth.username","username")
conf.set("spark.cassandra.auth.password","passwd")
conf.set("spark.cassandra.connection.port", "9042")
# conf.set("spark.sql.catalog.myCatalog", "com.datastax.spark.connector.datasource.CassandraCatalog")
sc = SparkContext(conf=conf)
# sc.setLogLevel("ERROR")
sqlContext = SQLContext(sc)
list_p = [('John',19),('Smith',29),('Adam',35),('Henry',50)]
rdd = sc.parallelize(list_p)
ppl = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
DF_ppl = sqlContext.createDataFrame(ppl)
# It works well until now
def load_and_get_table_df(keys_space_name, table_name):
table_df = sqlContext.read\
.format("org.apache.spark.sql.cassandra")\
.option("keyspace",keys_space_name)\
.option("table",table_name)\
.load()
return table_df
movies = load_and_get_table_df("weather", "currentweatherconditions")
The error I get is;
Someone have any idea with that?
This happens because you're specifying only spark.jars property, and pointing to the single jar. But spark cassandra connector depends on the number of the additional jars that aren't included into that list. I recommend instead either use spark.jars.packages with coordinate com.datastax.spark:spark-cassandra-connector_2.12:3.0.0, or specify in spark.jars the path to the assembly jar that has all necessary dependencies.
btw, 3.0 was release several months ago - why are you still using beta?

sc is not defined in SparkContext

My Spark package is spark-2.2.0-bin-hadoop2.7.
I exported spark variables as
export SPARK_HOME=/home/harry/spark-2.2.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
I opened spark notebook by
pyspark
I am able to load packages from spark
from pyspark import SparkContext, SQLContext
from pyspark.ml.regression import LinearRegression
print(SQLContext)
output is
<class 'pyspark.sql.context.SQLContext'>
But my error is
print(sc)
"sc is undefined"
plz can anyone help me out ...!
In pysparkShell, SparkContext is already initialized as SparkContext(app=PySparkShell, master=local[*]) so you just need to use getOrCreate() to set the SparkContext to a variable as
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
For coding purpose in simple local mode, you can do the following
from pyspark import SparkConf, SparkContext, SQLContext
conf = SparkConf().setAppName("test").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
print(sc)
print(sqlContext)

PySpark + jupyter notebook

I am trynig to configure a spark context into my notebook, but there is something wrong, I do :
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
if sc==sc:
sc.stop()
if spark==spark:
spark.stop()
conf = SparkConf()
conf = conf.setAppName(appName)
conf = conf.set("spark.master", master)
conf = conf.set("spark.python.worker.memory", "1042M")
spark.stop()
session_builder = SparkSession.builder
session_builder = session_builder.master(master)
spark = session_builder.getOrCreate()
and this give me an error :
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Can we change the configuration of spark in a jupyter notebook ?
And how ?
I am on the last version of spark with a standalone cluster.
Following the propose action I did :
which seems to mean the spark Context has been recreated, but the sparSession is not linked to the new sc anymore.
Just use the config option when setting SparkSession (as of 2.4)
MAX_MEMORY = "5g"
spark = SparkSession \
.builder \
.appName("Foo") \
.config("spark.executor.memory", MAX_MEMORY) \
.config("spark.driver.memory", MAX_MEMORY) \
.getOrCreate()
From the code above, what I understand is sc is you sparkcontext and spark is your sparkSession variable. You are stopping both of them and then using spark.stop() again on an already terminated session. Instead use this:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = (SparkConf()
.setMaster("local")
.setAppName("App_name")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
You can find the documentation here: Pyspark
If you have configured your notebook with pyspark, you don't need to stop a spark context and create a new one. Instead you can you sc as you spark context. You can pass additional configurations via spark-submit as command line arguments. You can refer the configuration documentation here:Pyspark Configuration

Resources