Understanding the jars in pyspark - apache-spark

I'm new to spark and my understanding is this:
jars are like a bundle of java code files
Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for them to execute the package API calls that the user interacts with. These jar files are like the backend code for those API calls
Questions:
Why are these jar files needed. Why could it not have sufficed to have all the code in python? (I guess the answer is that originally Spark is written in scala and there it distributes its dependencies as jars. So to not have to create that codebase mountain again, the python libraries just call that javacode in python interpreter through some converter that converts java code to equivalent python code. Please if I have understood right)
You specify these jar files locations while creating the spark context via spark.driver.extraClassPath and spark.executor.extraClassPath. These are outdated parameters though I guess. What is the recent way to specify these jar files location?
Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located? Why do not the libraries make it clear where their specific jar files are going to be?
I understand I might not be making sense here and what I have mentioned above is partly just my hunch that that is how it must be happening.
So, can you please help me understand this whole business with jars and how to find and specify them?

Each library that I install that internally uses spark (or pyspark)
has its own jar files
Can you tell which library are you trying to install ?
Yes, external libraries can have jars even if you are writing code in python.
Why ?
These libraries must be using some UDF (User Defined Functions). Spark runs the code in java runtime. If these UDF are written in python, then there will be lot of serialization and deserialization time due to converting data into something readable by python.
Java and Scala UDFs are usually faster that's why some libraries ship with jars.
Why could it not have sufficed to have all the code in python?
Same reason, scala/java UDFs are faster than python UDF.
What is the recent way to specify these jar files location?
You can use spark.jars.packages property. It will copy to both driver and executor.
Where do I find these jars for each library that I install? For
example synapseml. What is the general idea about where the jar files
for a package are located?
https://github.com/microsoft/SynapseML#python
They have mentioned here what jars are required i.e. com.microsoft.azure:synapseml_2.12:0.9.4
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
import synapse.ml
Can you try the above snippet?

Related

Add `hadoop-cloud` to Spark's classpath

Since the recent announcement of S3 strong consistency on reads and writes, I would like to try new S3A committers such as the magic one.
According to the Spark documentation, we need to add the two class paths: BindingParquetOutputCommitter and PathOutputCommitProtocol adde in this commit.
The official documentation suggests using Spark built with hadoop3.2 profile. Is there any way to add the two classes without recompiling Spark? (I cannot use already built Spark for some technical reasons)
I am using Spark 3.0.1
I already checked this answer but unfortunately, the OP switched to open source S3A committers to provided one by EMR.
You need a version of spark built with the -Phadoop-cloud module. which adds the new classes into spark-hadoop-cloud.jar, and adds in the relevant dependencies, which for S3A are
hadoop-aws-${the-exact-version-of-hadoop-jars-you-have}.jar
aws-sdk-something-${the-exact-version-that-hadoop-jar-was-built-with}.jar
so you could check out the spark branch you use, and do a maven build only of that module
mvn -pl hadoop-cloud -Phadoop-cloud -Dhadoop.version=$hadoop-version install -DskipTests
and you get a new spark-hadoop-cloud JAR which you can use with the new stuff
the s3a committers only came in with hadoop-3.1
we (I) have been busy fielding some race conditions with jobIDs and the "staging committer"
and, given S3 is consistent, I'd recommend the magic committer.
You can test this stuff in spark standalone, just do some minimal job to write data and verify that the _SUCCESS file contains some JSON summary of the job.
Whichever committer you use, make sure your buckets are set up to delete incomplete uploads after a few days. You should do that everywhere anyway.
HTH
If you are building with stevel’s mvn command you should include the -Phadoop-3.2 flag as well so that the extra-source-dir flag is picked up in pom.xml and compiles the committer classes into the JAR. So the full command would be mvn -pl hadoop-cloud -Phadoop-cloud -Phadoop-3.2 -Dhadoop.version=$hadoop-version install -DskipTests. See https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/hadoop-cloud/pom.xml#L207

Creating a UDF in spark

I am trying to create a permanent function in spark using geomesa-spark-jts.
Geomesa-spark-jts has huge potential in the larger LocationTech community.
I started first by downloading geomesa-spark-jts which contain the following
The after that I have launched spark like this (I made sure that the jar is within the path)
Now whew I use ST_Translate which come with that package, it does give me a result
But the problem is when I try to define ST_Translate as a UDF , I get the following error
The functions you mentioned are already supported in GeoMesa 2.0.0 for Spark 2.2.0. http://www.geomesa.org/documentation/user/spark/sparksql_functions.html
The geomesa-accumulo-spark-runtime jar is a shaded jar that includes the code from geomesa-spark-jts. You might be hitting issues with having the classes defined in two different jars.
In order to use st_translate with hive, I believe that you would have to implement a new class that extends org.apache.hadoop.hive.ql.exec.UDF and invokes the GeoMesa function.

What should be the input to setJars() method in Spark Streaming

val conf = new SparkConf(true)
.setAppName("Streaming Example")
.setMaster("spark://127.0.0.1:7077")
.set("spark.cassandra.connection.host","127.0.0.1")
.set("spark.cleaner.ttl","3600")
.setJars(Array("your-app.jar"))
Lets say I am creating a Spark Streaming Application
What should be the content of "your-app.jar" file ? Do I have to create it manually in my local file system and pass the path or Is that a Scala compiled file using sbt.
If thats a scala file please help to write the code
Since I am a beginer I am just trying to run some sample codes.
setJars method of the SparkConf class takes external JARs that need to be distributed on the cluster. Any external drivers like JDBC, etc.
You do not have to pass your own application JAR in this if that's what you are asking.

Why is difference between sqlContext.read.load and sqlContext.read.text?

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.
With sqlContext.read.load you can define the data source format using format parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.
The difference is:
text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6
To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit / pyspark commands.
Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

Cross Compiled jar file with scala version : Spark

I cant run my very first simple spark program with scala ide.
I checked all my properties and i believe that are correct.
this is the link with the properties.
any help ?
The problem is that you are trying to include Scala 2.11.8 as a dependency in your application, while Spark artifacts rely on Scala 2.10.
You have two options to solve your problem:
Use Scala 2.10.x
Use Spark artifacts that rely on Scala 2.11 (e.g. spark-core_2.11 instead of spark-core_2.10)

Resources