What versions of avro and parquet formats does Spark support? - apache-spark

Does Spark 2.0 support avro and parquet files? What versions?
I have downloaded spark-avro_2.10-0.1.jar and got this error during load:
Name: java.lang.IncompatibleClassChangeError
Message: org.apache.spark.sql.sources.TableScan
StackTrace: at java.lang.ClassLoader.defineClassImpl(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:349)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:154)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:727)
at java.net.URLClassLoader.access$400(URLClassLoader.java:95)
at java.net.URLClassLoader$ClassFinder.run(URLClassLoader.java:1182)
at java.security.AccessController.doPrivileged(AccessController.java:686)
at java.net.URLClassLoader.findClass(URLClassLoader.java:602)

You are just using the wrong dependency. You should use the spark-avro dependency that is compiles with Scala 2.11. You can find it here.
As for parquet, it's supported without any dependency to add to your application.

Does spark 2.0 supports avro and parquet files?
Avro format is not supportd in Spark 2.x out of the box. You have to use an external package, e.g. spark-avro.
Name: java.lang.IncompatibleClassChangeError
Message: org.apache.spark.sql.sources.TableScan
The reason for java.lang.IncompatibleClassChangeError is that you used spark-avro_2.10-0.1.jar that was compiled for Scala 2.10, but Spark 2.0 uses Scala 2.11 by default. This inevitably leads to this IncompatibleClassChangeError error.
You should rather load the spark-avro package using --packages command line option (as described in the official documentation of spark-avro in With spark-shell or spark-submit):
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.
Parquet format is the default format when loading or saving datasets.
// loading parquet datasets
spark.read.load
// saving in parquet format
mydataset.write.save
You may want to read up on Parquet Files support in the official documentation:
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Parquet 1.8.2 is used (as you can see in Spark's pom.xml)

Related

Spark3.2 write parquet files in spark2.3.1 format

Hello All I am experiencing new issue due to third party reader.
I have written parquet files thru spark 3.2 but these parquet files can't be read by Dremio 20.4 version. I want to know do we have any flag or any way in spark 3.2 version , we can write the parquet same format as spark 2.3.1 version. Please let me know if you need more information.
I researched the spark3.2 API but didn't get any flag which can easily write in spark 2.3.1 version . I have to either write parquet files manually which may need more reengineering.

Can Spark load a proto definition file similar to loading Avro schema?

Spark supports loading Avro schema files with the spark-avro module and the "avroSchema" property: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#data-source-option. This makes it easy to convert an Avro schema into a Spark StructType.
Is there similar support for loading a .proto file containing Protobuf message definitions, and converting them to a Spark StructType?

Spark AVRO compatible with BigQuery

I'm trying to create an external table in Hive and another in BigQuery using the same data stored in Google Storage in Avro format wrote with Spark.
I'm using a Dataproc cluster with Spark 2.2.0, Spark-avro 4.0.0 and Hive 2.1.1
There are same differences between Avro versions/packages but If I create the table using Hive and then I write the files using Spark, I'm able to see them in Hive.
But for BigQuery is different, it is able to read Hive Avro files but NOT Spark Avro files.
Error:
The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .someField
Searching a little about the error, the problem is that Spark Avro files are different from Hive/BigQuery Avro files.
I don't know exactly how to fix this, maybe using different Avro package in Spark, but I haven't found which one is compatible with all the systems.
Also I would like to avoid tricky solutions like create a temporary table in Hive and create another using insert into ... select * from ... I'll write a lot of data and I would like to avoid this kind of solutions
Any help would be appreciated. Thanks
The error message is thrown by the C++ Avro library, which BigQuery uses. Hive probably uses the Java Avro library. The C++ library doesn't like namespace to start with ".".
This is the code from the library:
if (! ns_.empty() && (ns_[0] == '.' || ns_[ns_.size() - 1] == '.' || std::find_if(ns_.begin(), ns_.end(), invalidChar1) != ns_.end())) {
throw Exception("Invalid namespace: " + ns_);
}
Spark-avro has additional option recordNamespace to set root namespace, so it will not start from ..
https://github.com/databricks/spark-avro/blob/branch-4.0/README-for-old-spark-versions.md
Wondering if you ever found an answer to this.
I am seeing the same thing, where I am trying to load data into a bigquery table. The library first loads the data into GCS in avro format. The schema has an array of struct as well, and the namespace beings with a .

Why is difference between sqlContext.read.load and sqlContext.read.text?

I am only trying to read a textfile into a pyspark RDD, and I am noticing huge differences between sqlContext.read.load and sqlContext.read.text.
s3_single_file_inpath='s3a://bucket-name/file_name'
indata = sqlContext.read.load(s3_single_file_inpath, format='com.databricks.spark.csv', header='true', inferSchema='false',sep=',')
indata = sqlContext.read.text(s3_single_file_inpath)
The sqlContext.read.load command above fails with
Py4JJavaError: An error occurred while calling o227.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.csv. Please find packages at http://spark-packages.org
But the second one succeeds?
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
Why is difference between sqlContext.read.load and sqlContext.read.text?
sqlContext.read.load assumes parquet as the data source format while sqlContext.read.text assumes text format.
With sqlContext.read.load you can define the data source format using format parameter.
Depending on the version of Spark 1.6 vs 2.x you may or may not load an external Spark package to have support for csv format.
As of Spark 2.0 you no longer have to load spark-csv Spark package since (quoting the official documentation):
NOTE: This functionality has been inlined in Apache Spark 2.x. This package is in maintenance mode and we only accept critical bug fixes.
That would explain why you got confused as you may have been using Spark 1.6.x and have not loaded the Spark package to have csv support.
Now, I am confused by this because all of the resources I see online say to use sqlContext.read.load including this one: https://spark.apache.org/docs/1.6.1/sql-programming-guide.html.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html is for Spark 1.6.1 when spark-csv Spark package was not part of Spark. It happened in Spark 2.0.
It is not clear to me when to use which of these to use when. Is there a clear distinction between these?
There's none actually iff you use Spark 2.x.
If however you use Spark 1.6.x, spark-csv has to be loaded separately using --packages option (as described in Using with Spark shell):
This package can be added to Spark using the --packages command line option. For example, to include it when starting the spark shell
As a matter of fact, you can still use com.databricks.spark.csv format explicitly in Spark 2.x as it's recognized internally.
The difference is:
text is a built-in input format in Spark 1.6
com.databricks.spark.csv is a third party package in Spark 1.6
To use third party Spark CSV (no longer needed in Spark 2.0) you have to follow the instructions on spark-csv site, for example provide
--packages com.databricks:spark-csv_2.10:1.5.0
argument with spark-submit / pyspark commands.
Beyond that sqlContext.read.formatName(...) is a syntactic sugar for sqlContext.read.format("formatName") and sqlContext.read.load(..., format=formatName).

Cross Compiled jar file with scala version : Spark

I cant run my very first simple spark program with scala ide.
I checked all my properties and i believe that are correct.
this is the link with the properties.
any help ?
The problem is that you are trying to include Scala 2.11.8 as a dependency in your application, while Spark artifacts rely on Scala 2.10.
You have two options to solve your problem:
Use Scala 2.10.x
Use Spark artifacts that rely on Scala 2.11 (e.g. spark-core_2.11 instead of spark-core_2.10)

Resources