Can Spark load a proto definition file similar to loading Avro schema? - apache-spark

Spark supports loading Avro schema files with the spark-avro module and the "avroSchema" property: https://spark.apache.org/docs/latest/sql-data-sources-avro.html#data-source-option. This makes it easy to convert an Avro schema into a Spark StructType.
Is there similar support for loading a .proto file containing Protobuf message definitions, and converting them to a Spark StructType?

Related

Is it possible to save Avro serialized records in Parquet format?

I'm making a job that serializes some records in Avro format (using org.apache.spark.sql.avro.functions.to_avro) and then saves them in Parquet format. But I'm getting some errors saying there are malformed records when reading from the parquet format.
When I store those avro serialized records in kafka and then use from_avro to read them there is no problem.
I'm thinking it is because the Avro serialized records should be saved as avro format instead but I'm not 100% sure about this.

Spark AVRO compatible with BigQuery

I'm trying to create an external table in Hive and another in BigQuery using the same data stored in Google Storage in Avro format wrote with Spark.
I'm using a Dataproc cluster with Spark 2.2.0, Spark-avro 4.0.0 and Hive 2.1.1
There are same differences between Avro versions/packages but If I create the table using Hive and then I write the files using Spark, I'm able to see them in Hive.
But for BigQuery is different, it is able to read Hive Avro files but NOT Spark Avro files.
Error:
The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .someField
Searching a little about the error, the problem is that Spark Avro files are different from Hive/BigQuery Avro files.
I don't know exactly how to fix this, maybe using different Avro package in Spark, but I haven't found which one is compatible with all the systems.
Also I would like to avoid tricky solutions like create a temporary table in Hive and create another using insert into ... select * from ... I'll write a lot of data and I would like to avoid this kind of solutions
Any help would be appreciated. Thanks
The error message is thrown by the C++ Avro library, which BigQuery uses. Hive probably uses the Java Avro library. The C++ library doesn't like namespace to start with ".".
This is the code from the library:
if (! ns_.empty() && (ns_[0] == '.' || ns_[ns_.size() - 1] == '.' || std::find_if(ns_.begin(), ns_.end(), invalidChar1) != ns_.end())) {
throw Exception("Invalid namespace: " + ns_);
}
Spark-avro has additional option recordNamespace to set root namespace, so it will not start from ..
https://github.com/databricks/spark-avro/blob/branch-4.0/README-for-old-spark-versions.md
Wondering if you ever found an answer to this.
I am seeing the same thing, where I am trying to load data into a bigquery table. The library first loads the data into GCS in avro format. The schema has an array of struct as well, and the namespace beings with a .

What versions of avro and parquet formats does Spark support?

Does Spark 2.0 support avro and parquet files? What versions?
I have downloaded spark-avro_2.10-0.1.jar and got this error during load:
Name: java.lang.IncompatibleClassChangeError
Message: org.apache.spark.sql.sources.TableScan
StackTrace: at java.lang.ClassLoader.defineClassImpl(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:349)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:154)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:727)
at java.net.URLClassLoader.access$400(URLClassLoader.java:95)
at java.net.URLClassLoader$ClassFinder.run(URLClassLoader.java:1182)
at java.security.AccessController.doPrivileged(AccessController.java:686)
at java.net.URLClassLoader.findClass(URLClassLoader.java:602)
You are just using the wrong dependency. You should use the spark-avro dependency that is compiles with Scala 2.11. You can find it here.
As for parquet, it's supported without any dependency to add to your application.
Does spark 2.0 supports avro and parquet files?
Avro format is not supportd in Spark 2.x out of the box. You have to use an external package, e.g. spark-avro.
Name: java.lang.IncompatibleClassChangeError
Message: org.apache.spark.sql.sources.TableScan
The reason for java.lang.IncompatibleClassChangeError is that you used spark-avro_2.10-0.1.jar that was compiled for Scala 2.10, but Spark 2.0 uses Scala 2.11 by default. This inevitably leads to this IncompatibleClassChangeError error.
You should rather load the spark-avro package using --packages command line option (as described in the official documentation of spark-avro in With spark-shell or spark-submit):
$ bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
using --packages ensures that this library and its dependencies will be added to the classpath. The --packages argument can also be used with bin/spark-submit.
Parquet format is the default format when loading or saving datasets.
// loading parquet datasets
spark.read.load
// saving in parquet format
mydataset.write.save
You may want to read up on Parquet Files support in the official documentation:
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Parquet 1.8.2 is used (as you can see in Spark's pom.xml)

Convert Xml to Avro from Kafka to hdfs via spark streaming or flume

I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.

How to merge schema while loading avro in spark dataframe?

I am trying to read avro files using https://github.com/databricks/spark-avro and the avro schema evolved over time. I read like this with mergeSchema option set to true hoping that it would merge schema itself but it didn't work.
sqlContext.read.format("com.databricks.spark.avro").option("mergeSchema", "true").load('s3://xxxx/d=2015-10-27/h=*/')
What is the work around ?
Merging schema is not implemented for avro files in spark and there is no easy workaround. One solution would be to read your avro data file-by-file (or partition-by-partition) as separate data sets and then union those data sets. But that can be terribly slow.

Resources