Why is difference between sqlContext.read.load and sqlContext.read.text? - apache-spark

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?

Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.
With sqlContext.read.load you can define the data source format using format parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.

The difference is:
text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6
To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit / pyspark commands.
Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

Related

Why are raw strings not supported in Spark SQL when running in EMR?

In Pyspark, using a Spark SQL function such as regexp_extract or regexp_replace, raw strings (string literals prefixed with r) are supported when running locally, but not when running on EMR.
A simple example to reproduce the issue is:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
which will run successfully on Pyspark 3.3.0 locally but raises the parsing exception:
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported
when executed on EMR. Looking at the session configuration options for Spark SQL, there doesn't appear to be any options which would change how raw strings are parsed - the closest option is spark.sql.parser.quotedRegexColumnNames.
Anecdotally, I remember having a conversation with a colleague a few years ago who said something about AWS having an internal custom Spark for running on EMR, but I have not found any documentation to corroborate that. Also, even if that were the case, I imagine they would maintain support for critical features like this.
There also could be a Spark configuration option which I missed during my investigation.
For anyone who may have some deeper insight or recognizes the issue, why does this discrepancy exist?
Thank you in advance!
Related posts:
SparkSQL Regular Expression: Cannot remove backslash from text (Developer tried a recommended solution for regex problem using raw strings, but failed with the parsing error on EMR - example code snippet based on this question)
regexp extract pyspark sql: ParseException Literals of type 'R' are currently not supported ("If I try this code with Pyspark locally it works (Pyspark version 3.3.0), but when I run this code in an EMR job it fails, I'm using emr-6.6.0 as application")

Understanding the jars in pyspark

I'm new to spark and my understanding is this:
jars are like a bundle of java code files
Each library that I install that internally uses spark (or pyspark) has its own jar files that need to be available with both driver and executors in order for them to execute the package API calls that the user interacts with. These jar files are like the backend code for those API calls
Questions:
Why are these jar files needed. Why could it not have sufficed to have all the code in python? (I guess the answer is that originally Spark is written in scala and there it distributes its dependencies as jars. So to not have to create that codebase mountain again, the python libraries just call that javacode in python interpreter through some converter that converts java code to equivalent python code. Please if I have understood right)
You specify these jar files locations while creating the spark context via spark.driver.extraClassPath and spark.executor.extraClassPath. These are outdated parameters though I guess. What is the recent way to specify these jar files location?
Where do I find these jars for each library that I install? For example synapseml. What is the general idea about where the jar files for a package are located? Why do not the libraries make it clear where their specific jar files are going to be?
I understand I might not be making sense here and what I have mentioned above is partly just my hunch that that is how it must be happening.
So, can you please help me understand this whole business with jars and how to find and specify them?
Each library that I install that internally uses spark (or pyspark)
has its own jar files
Can you tell which library are you trying to install ?
Yes, external libraries can have jars even if you are writing code in python.
Why ?
These libraries must be using some UDF (User Defined Functions). Spark runs the code in java runtime. If these UDF are written in python, then there will be lot of serialization and deserialization time due to converting data into something readable by python.
Java and Scala UDFs are usually faster that's why some libraries ship with jars.
Why could it not have sufficed to have all the code in python?
Same reason, scala/java UDFs are faster than python UDF.
What is the recent way to specify these jar files location?
You can use spark.jars.packages property. It will copy to both driver and executor.
Where do I find these jars for each library that I install? For
example synapseml. What is the general idea about where the jar files
for a package are located?
https://github.com/microsoft/SynapseML#python
They have mentioned here what jars are required i.e. com.microsoft.azure:synapseml_2.12:0.9.4
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:0.9.4") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
import synapse.ml
Can you try the above snippet?

Spark 2.3.3 outputing parquet to S3

A while back I had the problem that outputting directly parquets to S3 isn't really feasible and I needed a caching layer before I finally copy the parquets to S3 see this post
I know that HADOOP-13786 should fix this problem and it seems to be implemented in HDFS >3.1.0
Now the question is how do I use it in spark 2.3.3 as far as I understand it spark 2.3.3 comes with hdfs 2.8.5. I usually use flintrock to orchestrate my cluster on AWS. Is it just a matter of setting HDFS to 3.1.1 in the flintrock config and then I get all the goodies? Or do I still for example have to set something in code like I did before. For example like this:
conf = SparkConf().setAppName(appname)\
.setMaster(master)\
.set('spark.executor.memory','13g')\
.set('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version','2')\
.set('fs.s3a.fast.upload','true')\
.set('fs.s3a.fast.upload.buffer','disk')\
.set('fs.s3a.buffer.dir','/tmp/s3a')
(I know this is the old code and probably no longer relevant)
You'll need Hadoop 3.1, and a build of Spark 2.4 which has this PR applied: https://github.com/apache/spark/pull/24970
Some downstream products with their own Spark builds do this (HDP-3.1), but it's not (yet) in the apache builds.
With that you then need to configure parquet to use the new bridging committer (Parquet only allows subclasses of the Parquet committer), and select the specific S3A committer of three (long story) to use. The Staging committer is the one I'd recommend as its (a) based on the one Netflix use and (b) the one I've tested the most.
There's no fundamental reason why the same PR can't be applied to Spark 2.3, just that nobody has tried.

How to create custom writer for Spark Dataframe?

How can I create a custom write format for Spark Dataframe to use it like df.write.format("com.mycompany.mydb").save()? I've tried reading through Datastax Cassandra connector code but still couldn't figure it out
Spark 3.0 completely changes the API. Some new interfaces e.g. TableProvider and SupportsWrite have been added.
You might find this guide helpful.
Using Spark's DataSourceV2.
If your are using Spark version < 2.3, then you can use Spark Data Source API V1.

What versions of avro and parquet formats does Spark support?

Does Spark 2.0 support avro and parquet files? What versions?
I have downloaded spark-avro_2.10-0.1.jar and got this error during load:
Name: java.lang.IncompatibleClassChangeError
Message: org.apache.spark.sql.sources.TableScan
StackTrace: at java.lang.ClassLoader.defineClassImpl(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:349)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:154)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:727)
at java.net.URLClassLoader.access$400(URLClassLoader.java:95)
at java.net.URLClassLoader$ClassFinder.run(URLClassLoader.java:1182)
at java.security.AccessController.doPrivileged(AccessController.java:686)
at java.net.URLClassLoader.findClass(URLClassLoader.java:602)
You are just using the wrong dependency. You should use the spark-avro dependency that is compiles with Scala 2.11. You can find it here.
As for parquet, it's supported without any dependency to add to your application.
Does spark 2.0 supports avro and parquet files?
Avro format is not supportd in Spark 2.x out of the box. You have to use an external package, e.g. spark-avro.
Name: java.lang.IncompatibleClassChangeError
Message: org.apache.spark.sql.sources.TableScan
The reason for java.lang.IncompatibleClassChangeError is that you used spark-avro_2.10-0.1.jar that was compiled for Scala 2.10, but Spark 2.0 uses Scala 2.11 by default. This inevitably leads to this IncompatibleClassChangeError error.
You should rather load the spark-avro package using --packages command line option (as described in the official documentation of spark-avro in With spark-shell or spark-submit):
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.
Parquet format is the default format when loading or saving datasets.
// loading parquet datasets
spark.read.load
// saving in parquet format
mydataset.write.save
You may want to read up on Parquet Files support in the official documentation:
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Parquet 1.8.2 is used (as you can see in Spark's pom.xml)

Resources