Errors Loading csv file with spark-submit - apache-spark

I am new to py spark and I have been running jobs on Jupiter notebook which is running smoothly but having issues running spark-submit for loading a CSV file.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
load csv file
netflix_df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema","true") \
.load("netflix_titles.csv")
The above code works perfectly on Jupiter notebook but doesn't work when trying to run the same code saved in a python file with spark-submit
I get the following errors
NameError: name 'spark' is not defined
when i replace spark.read.format("csv") with sc.read.format("csv")
I get the following error
AttributeError: 'SparkContext' object has no attribute 'read'

You need to create a spark session.
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder()
.master("local[1]") # replace with suitable parameter
.appName("demo")
.getOrCreate()
#now you use spark.read

Related

getting error while trying to read athena table in spark

I have the following code snippet in pyspark:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())
if __name__ == "__main__":
validate_data()
Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )
but when we are running as a python script on terminal, its throwing this error
Error message
AttributeError: 'SparkContext' object has no attribute 'read'
We have to automate these notebooks, so we are trying to convert them to python scripts
You can only call read on a Spark Session, not on a Spark Context.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)
Or you can convert the Spark context to a Spark session
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

Reading Excel (.xlsx) file in pyspark

I am trying to read a .xlsx file from local path in PySpark.
I've written the below code:
from pyspark.shell import sqlContext
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('local') \
.appName('Planning') \
.enableHiveSupport() \
.config('spark.executor.memory', '2g') \
.getOrCreate()
df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()
Error:
TypeError: 'DataFrameReader' object is not callable
You can use pandas to read .xlsx file and then convert that to spark dataframe.
from pyspark.sql import SparkSession
import pandas
spark = SparkSession.builder.appName("Test").getOrCreate()
pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
df = spark.createDataFrame(pdf)
df.show()
You could use crealytics package.
Need to add it to spark, either by maven co-ordinates or while starting the spark shell as below.
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-excel_2.12:0.13.1
For databricks users- need to add it as a library by navigating
Cluster - 'clusterName' - Libraries - Install New - Provide 'com.crealytics:spark-excel_2.12:0.13.1' under maven coordinates.
df = spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'Sheet1'!")
.option("header", "true")
.option("inferSchema", "true")
.load("C:\P_DATA\tyco_93_A.xlsx")
More options are available in below github page.
https://github.com/crealytics/spark-excel

NameError: name 'SparkSession' is not defined

I'm new to cask cdap and Hadoop environment.
I'm creating a pipeline and I want to use a PySpark Program. I have all the script of the spark program and it works when I test it by command like, insted it doesn't if I try to copy- paste it in a cdap pipeline.
It gives me an error in the logs:
NameError: name 'SparkSession' is not defined
My script starts in this way:
from pyspark.sql import *
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import trim, to_date, year, month
sc= SparkContext()
How can I fix it?
Spark connects with the local running spark cluster through SparkContext. A better explanation can be found here https://stackoverflow.com/a/24996767/5671433.
To initialise a SparkSession, a SparkContext has to be initialized.
One way to do that is to write a function that initializes all your contexts and a spark session.
def init_spark(app_name, master_config):
"""
:params app_name: Name of the app
:params master_config: eg. local[4]
:returns SparkContext, SQLContext, SparkSession:
"""
conf = (SparkConf().setAppName(app_name).setMaster(master_config))
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
sql_ctx = SQLContext(sc)
spark = SparkSession(sc)
return (sc, sql_ctx, spark)
This can then be called as
sc, sql_ctx, spark = init_spark("App_name", "local[4]")

PySpark + jupyter notebook

I am trynig to configure a spark context into my notebook, but there is something wrong, I do :
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
if sc==sc:
sc.stop()
if spark==spark:
spark.stop()
conf = SparkConf()
conf = conf.setAppName(appName)
conf = conf.set("spark.master", master)
conf = conf.set("spark.python.worker.memory", "1042M")
spark.stop()
session_builder = SparkSession.builder
session_builder = session_builder.master(master)
spark = session_builder.getOrCreate()
and this give me an error :
Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
Can we change the configuration of spark in a jupyter notebook ?
And how ?
I am on the last version of spark with a standalone cluster.
Following the propose action I did :
which seems to mean the spark Context has been recreated, but the sparSession is not linked to the new sc anymore.
Just use the config option when setting SparkSession (as of 2.4)
MAX_MEMORY = "5g"
spark = SparkSession \
.builder \
.appName("Foo") \
.config("spark.executor.memory", MAX_MEMORY) \
.config("spark.driver.memory", MAX_MEMORY) \
.getOrCreate()
From the code above, what I understand is sc is you sparkcontext and spark is your sparkSession variable. You are stopping both of them and then using spark.stop() again on an already terminated session. Instead use this:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = (SparkConf()
.setMaster("local")
.setAppName("App_name")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)
You can find the documentation here: Pyspark
If you have configured your notebook with pyspark, you don't need to stop a spark context and create a new one. Instead you can you sc as you spark context. You can pass additional configurations via spark-submit as command line arguments. You can refer the configuration documentation here:Pyspark Configuration

MySQL read with PySpark

I have the following test code:
from pyspark import SparkContext, SQLContext
sc = SparkContext('local')
sqlContext = SQLContext(sc)
print('Created spark context!')
if __name__ == '__main__':
df = sqlContext.read.format("jdbc").options(
url="jdbc:mysql://localhost/mysql",
driver="com.mysql.jdbc.Driver",
dbtable="users",
user="user",
password="****",
properties={"driver": 'com.mysql.jdbc.Driver'}
).load()
print(df)
When I run it, I get the following error:
java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
In Scala, this is solved by importing the .jar mysql-connector-java into the project.
However, in python I have no idea how to tell the pyspark module to link the mysql-connector file.
I have seen this solved with examples like
spark --package=mysql-connector-java testfile.py
But I don't want this since it forces me to run my script in a weird way. I would like an all python solution or copy a file somewhere or, add something to the Path.
You can pass arguments to spark-submit when creating your sparkContext before SparkConf is initialized:
import os
from pyspark import SparkConf, SparkContext
SUBMIT_ARGS = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
sc = SparkContext(conf=conf)
or you can add them to your $SPARK_HOME/conf/spark-defaults.conf
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://localhost/database_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file
If you are using pycharm and want to run line by line instead of submitting your .py through spark-submit, you can copy your .jar to c:\spark\jars\ and your code could be like:
from pyspark import SparkConf, SparkContext, sql
from pyspark.sql import SparkSession
sc = SparkSession.builder.getOrCreate()
sqlContext = sql.SQLContext(sc)
source_df = sqlContext.read.format('jdbc').options(
url='jdbc:mysql://localhost:3306/database1',
driver='com.mysql.cj.jdbc.Driver', #com.mysql.jdbc.Driver
dbtable='table1',
user='root',
password='****').load()
print (source_df)
source_df.show()

Resources