spark.driver.extraClassPath doesn't work in virtual PySpark environment - apache-spark

I'm saving data to a Postgres database and the job failed with the following:
py4j.protocol.Py4JJavaError: An error occurred while calling
o186.jdbc. : java.lang.ClassNotFoundException: org.postgresql.Driver
Until I downloaded the postgres jar to the spark/jars folder when I had spark installed globally.
I have sense moved to a new machine and instead only installed pyspark in a virtual environemnt (venv) via pip.
I tried setting the extraClassPath config value to my jar folder inside the virtual directory but that didn't work:
session = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.getOrCreate()
Have tried relative and absolute path as well as wild card (*) and full filename. Nothing seems to work.
Setting the spark.jars.packages did correctly load the package from Maven however:
.config('spark.jars.packages', 'org.postgresql:postgresql:42.2.6') \
How can I make the extraClassPath work?

You will also need to add jar in executor class path.
session = SparkSession \
.builder \
.config("spark.driver.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.config("spark.executor.extraClassPath", "/home/me/source/acme/.venv/lib/python3.6/site-packages/pyspark/jars/postgresql-42.2.6.jar") \
.getOrCreate()
EDIT:
To semantically replicate spark.jars.package you can use spark.jars with absolute path to jar file. Also just to be sure check your jar and confirm it has proper MENIFEST for driver.

Related

df.show() prints empty result while in hdfs it is not empty

I have a pyspark application which is submitted to yarn with multiple nodes and it also reads parquet from hdfs
in my code, i have a dataframe which is read directly from hdfs:
df = self.spark.read.schema(self.schema).parquet("hdfs://path/to/file")
when i use df.show(n=2) directly in my code after the above code, it outputs:
+---------+--------------+-------+----+
|aaaaaaaaa|bbbbbbbbbbbbbb|ccccccc|dddd|
+---------+--------------+-------+----+
+---------+--------------+-------+----+
But when i manually go to the hdfs path, data is not empty.
What i have tried?
1- at first i thought that i may have used few cores and memory for my executor and driver, so i doubled them and nothing changed.
2- then i thought that the path may be wrong, so i gave it an wrong hdfs path and it throwed error that this path does not exist
What i am assuming?
1- i think this may have something to do with drivers and executors
2- it may i have something to do with yarn
3- configs provided when using spark-submit
current config:
spark-submit \
--master yarn \
--queue my_queue_name \
--deploy-mode cluster \
--jars some_jars \
--conf spark.yarn.dist.files some_files \
--conf spark.sql.catalogImplementation=in-memory \
--properties-file some_zip_file \
--py-files some_py_files \
main.py
What i am sure
data is not empty. the same hdfs path is provided in another project which is working fine.
So the problem was with the jar files i was providing
The hadoop version was 2.7.2 and i changed it to 3.2.0 and it's working fine

How to access external property file in spark-submit job?

I am using spark 2.4.1 version and java8.
I am trying to load external property file while submitting my spark job using spark-submit.
As I am using below TypeSafe to load my property file.
<groupId>com.typesafe</groupId>
<artifactId>config</artifactId>
<version>1.3.1</version>
In my code I am using
public static Config loadEnvProperties(String environment) {
Config appConf = ConfigFactory.load(); // loads my "resouces" folder "application.properties" file
return appConf.getConfig(environment);
}
To externalize this "application.properties" file I tried this as suggested by an expert while spark-submit as below
spark-submit \
--master yarn \
--deploy-mode cluster \
--name Extractor \
--jars "/local/apps/jars/*.jar" \
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
--class Driver \
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.executor.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
--conf spark.driver.extraJavaOptions=-Dlog4j.debug \
--conf spark.driver.extraClassPath=. \
migration-0.0.1.jar sit
I placed "log4j.properties" & "applicationNew.properties" files same folder where I am running my spark-submit.
1) In the above shell script if I keep
--files /local/apps/log4j.properties, /local/apps/applicationNew.properties \
Error :
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/local/apps//applicationNew.properties
at org.apache.spark.deploy.SparkSubmitArguments.error(SparkSubmitArguments.scala:657)
So what is wrong here ?
2) Then i changed above script like shown i.e.
--files /local/apps/log4j.properties \
--files /local/apps/applicationNew.properties \
when I run spark job then I will get following error.
19/08/02 14:19:09 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
at com.typesafe.config.impl.SimpleConfig.findKeyOrNull(SimpleConfig.java:152)
So what is wrong here ? why not loading the applicationNew.properties file ?
3) When I debugged it as below
i.e. printed "config.file"
String ss = System.getProperty("config.file");
logger.error ("config.file : {}" , ss);
Error :
19/08/02 14:19:09 ERROR Driver: config.file : null
19/08/02 14:19:09 ERROR yarn.ApplicationMaster: User class threw exception: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'sit'
So how to set "config.file" option from spark-submit ?
How to fix above errors and load properties from external applicationNew.properties file ?
The proper way to list files for the --files, --jars and other similar arguments is via a comma without any spaces (this is a crucial thing, and you see the exception about invalid main class precisely because of this):
--files /local/apps/log4j.properties,/local/apps/applicationNew.properties
If file names themselves have spaces in it, you should use quotes to escape these spaces:
--files "/some/path with/spaces.properties,/another path with/spaces.properties"
Another issue is that you specify the same property twice:
...
--conf spark.driver.extraJavaOptions=-Dconfig.file=./applicationNew.properties \
...
--conf spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j.properties \
...
There is no way for spark-submit to know how to merge these values, therefore only one of them is used. This is the reason why you see null for the config.file system property: it's just the second --conf argument takes priority and overrides the extraJavaOptions property with a single path to the log4j config file. Thus, the correct way is to specify all these values as one property:
--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:./log4j.properties -Dconfig.file=./applicationNew.properties"
Note that because of quotes, the entire spark.driver.extraJavaOptions="..." is one command line argument rather than several, which is very important for spark-submit to pass these arguments to the driver/executor JVM correctly.
(I also changed the log4j.properties file to use a proper URI instead of a file. I recall that without this path being a URI it might not work, but you can try either way and check for sure.)
--files and SparkFiles.get
With --files you should access the resource using SparkFiles.get as follows:
$ ./bin/spark-shell --files README.md
scala> import org.apache.spark._
import org.apache.spark._
scala> SparkFiles.get("README.md")
res0: String = /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8w0000gn/T/spark-f0b16df1-fba6-4462-b956-fc14ee6c675a/userFiles-eef6d900-cd79-4364-a4a2-dd177b4841d2/README.md
In other words, Spark will distribute the --files to executors, but the only way to know the path of the files is to use SparkFiles utility.
getResourceAsStream(resourceFile) and InputStream
The other option would be to package all resource files into a jar file and bundle it together with the other jar files (either as a single uber-jar or simply as part of CLASSPATH of the Spark app) and use the following trick:
this.getClass.getClassLoader.getResourceAsStream(resourceFile)
With that, regardless of the jar file the resourceFile is in, as long as it's on the CLASSPATH, it should be available to the application.
I'm pretty sure any decent framework or library that uses resource files for configuration, e.g. Typesafe Config, accepts InputStream as the way to read resource files.
You could also include the --files as part of a jar file that is part of the CLASSPATH of the executors, but that'd be obviously less flexible (as every time you'd like to submit your Spark app with a different file, you'd have to recreate the jar).

Do pyspark need a local Spark installation?

I'm trying to get going with spark. Trying to create a simple SQL connection to a database while running Spark in a docker container.
I do not have Spark installed on my laptop. Only inside my docker container.
I got the following code on my laptop:
spark = SparkSession \
.builder \
.master("spark://localhost:7077") \ # <-- Docker container with master and worker
.appName("sparktest") \
.getOrCreate()
jdbcDF = spark.read.format("jdbc") \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", "jdbc:sqlserver://xxx") \
.option("dbtable", "xxx") \
.option("user", "xxx") \
.option("password", "xxx").load()
I can't get it to work.
I either get java.sql.SQLException: No suitable driver or ClassNotFoundException from Java.
I've moved the files to the container and everything seems fine over there.
I've made sure the mssql jar files are on the SPARK_CLASSPATH on both driver and executor.
Am I supposed to have Spark installed locally for me to use PySpark against the remote master running in my docker container?
It looks like its trying to find the SQL driver on my laptop?
Everything is fine if i run code using spark-submit from inside the docker container.
I was trying to avoid going the route of jupyter hosted inside the docker container, but was hoping to not having to install Spark on my Windows laptop and keeping it in my linux container.
I faced it before and for a solution you can download jdbc driver and set the driver configuration manually with giving jdbc driver path
from pyspark.context import SparkConf
conf = SparkConf()
conf.set('spark.jars', '/PATH_OF_DRIVER/driver.jar')
conf.set('spark.executor.extraClassPath', '/PATH_OF_DRIVER/driver.jar')

How do I connect Spark to JDBC driver in Zeppelin?

I am trying to pull in data from a SQL server to a Hive table using Spark in a Zeppelin notebook.
I am trying to run the following code:
%pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import *
spark = SparkSession.builder \
.appName('sample') \
.getOrCreate()
#set url, table, etc.
df = spark.read.format('jdbc') \
.option('url', url) \
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver') \
.option('dbtable', table) \
.option('user', user) \
.option('password', password) \
.load()
However, I keep getting the exception:
...
Py4JJavaError: An error occurred while calling o81.load.
: java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
...
I have been trying to figure this out all day and I believe something is wrong with how I am trying to set up the driver. I have a driver under /tmp/sqljdbc42.jar on the instance. Can you please explain how I can let Spark know where this driver is? I have tried many different ways both through the shell and through the interpreter editor.
Thanks!
EDIT
I also should note that I loaded the jar to my instance throug Zeppelin's shell (%sh) using
curl -o /tmp/sqljdbc42.jar http://central.maven.org/maven2/com/microsoft/sqlserver/mssql-jdbc/6.4.0.jre8/mssql-jdbc-6.4.0.jre8.jar
pyspark --driver-class-path /tmp/sqljdbc42.jar --jars /tmp/sqljdbc42.jar
Here is how I fixed this:
scp driver jar onto the cluster driver node
Go to Zeppelin interpreter and scroll to the Spark section then click edit.
Write the complete path to the jar under artifacts e.g. /home/Hadoop/mssql-jdbc.jar and nothing else.
Click save.
Then you should be good!
You can add it through Web UI in Interpreter settings as follow:
Click Interpreter in menu
Click 'edit' button in the Spark interpreter
Add the path for the jar in the artifact field
Then just save and restart interpreter.
Similar to Tomas, you can add the driver (or any library) using maven in the interpreter:
Click Interpreter in menu
Click 'edit' button in the Spark interpreter
Add the path for the jar in the artifact field
Add the groupId:artifactId:version
For example, in your case, you can use com.microsoft.sqlserver:mssql-jdbc:jar:8.4.1.jre8 in artifact field.
When you restart the interpreter, it will download and add the dependency for you.

Pyspark - Load file: Path does not exist

I am a newbie to Spark. I'm trying to read a local csv file within an EMR cluster. The file is located in: /home/hadoop/. The script that I'm using is this one:
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()\
df = spark.read.csv('/home/hadoop/observations_temp.csv, header=True)
When I run the script raises the following error message:
pyspark.sql.utils.AnalysisException: u'Path does not exist:
hdfs://ip-172-31-39-54.eu-west-1.compute.internal:8020/home/hadoop/observations_temp.csv
Then, I found out that I have to add file:// in the file path so it can read the file locally:
df = spark.read.csv('file:///home/hadoop/observations_temp.csv, header=True)
But this time, the above approach raised a different error:
Lost task 0.3 in stage 0.0 (TID 3,
ip-172-31-41-81.eu-west-1.compute.internal, executor 1):
java.io.FileNotFoundException: File
file:/home/hadoop/observations_temp.csv does not exist
I think is because the file// extension just read the file locally and it does not distribute the file across the other nodes.
Do you know how can I read the csv file and make it available to all the other nodes?
You are right about the fact that your file is missing from your worker nodes thus that raises the error you got.
Here is the official documentation Ref. External Datasets.
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
So basically you have two solutions :
You copy your file into each worker before starting the job;
Or you'll upload in HDFS with something like : (recommended solution)
hadoop fs -put localfile /user/hadoop/hadoopfile.csv
Now you can read it with :
df = spark.read.csv('/user/hadoop/hadoopfile.csv', header=True)
It seems that you are also using AWS S3. You can always try to read it directly from S3 without downloading it. (with the proper credentials of course)
Some suggest that the --files tag provided with spark-submit uploads the files to the execution directories. I don't recommend this approach unless your csv file is very small but then you won't need Spark.
Alternatively, I would stick with HDFS (or any distributed file system).
I think what you are missing is explicitly setting the master node while initializing the SparkSession, try something like this
spark = SparkSession \
.builder \
.master("local") \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
and then read the file in the same way you have been doing
df = spark.read.csv('file:///home/hadoop/observations_temp.csv')
this should solve the problem...
Might be useful for someone running zeppelin on mac using Docker.
Copy files to custom folder : /Users/my_user/zeppspark/myjson.txt
docker run -p 8080:8080 -v /Users/my_user/zeppspark:/zeppelin/notebook --rm --name zeppelin apache/zeppelin:0.9.0
On Zeppelin you can run this to get your file:
%pyspark
json_data = sc.textFile('/zeppelin/notebook/myjson.txt')

Resources