Spark AVRO compatible with BigQuery - apache-spark

I'm trying to create an external table in Hive and another in BigQuery using the same data stored in Google Storage in Avro format wrote with Spark.
I'm using a Dataproc cluster with Spark 2.2.0, Spark-avro 4.0.0 and Hive 2.1.1
There are same differences between Avro versions/packages but If I create the table using Hive and then I write the files using Spark, I'm able to see them in Hive.
But for BigQuery is different, it is able to read Hive Avro files but NOT Spark Avro files.
Error:
The Apache Avro library failed to parse the header with the follwing error: Invalid namespace: .someField
Searching a little about the error, the problem is that Spark Avro files are different from Hive/BigQuery Avro files.
I don't know exactly how to fix this, maybe using different Avro package in Spark, but I haven't found which one is compatible with all the systems.
Also I would like to avoid tricky solutions like create a temporary table in Hive and create another using insert into ... select * from ... I'll write a lot of data and I would like to avoid this kind of solutions
Any help would be appreciated. Thanks

The error message is thrown by the C++ Avro library, which BigQuery uses. Hive probably uses the Java Avro library. The C++ library doesn't like namespace to start with ".".
This is the code from the library:
if (! ns_.empty() && (ns_[0] == '.' || ns_[ns_.size() - 1] == '.' || std::find_if(ns_.begin(), ns_.end(), invalidChar1) != ns_.end())) {
throw Exception("Invalid namespace: " + ns_);
}

Spark-avro has additional option recordNamespace to set root namespace, so it will not start from ..
https://github.com/databricks/spark-avro/blob/branch-4.0/README-for-old-spark-versions.md

Wondering if you ever found an answer to this.
I am seeing the same thing, where I am trying to load data into a bigquery table. The library first loads the data into GCS in avro format. The schema has an array of struct as well, and the namespace beings with a .

Related

Unsupported encoding: DELTA_BYTE_ARRAY when reading from Kusto using Kusto Spark connector or using Kusto export with Spark version < 3.3.0

Since last week we started getting java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY while reading from Kusto using the Kusto Spark connector 'Distributed' mode (same thing happens when trying to use the export command and use parquet read over it). How can we resolve this issue? Is it caused by change from the Kusto service or Spark?
We tried setting the configs "spark.sql.parquet.enableVectorizedReader=false", and "parquet.split.files=false". This works but we are worried about the outcome of this approach.
The change of behavior is due to Kusto rolling out a new implementation of Parquet writer that uses new encoding schemes, one of which being delta byte array for strings and other byte array-based Parquet types. This encoding scheme has been part of the Parquet format for a few years now and modern readers are expected to support it. i.e. Spark 3.3.0. This provides performance and cost improvements and therefore we highly advise customers to move to Spark 3.3.0 or above. Kusto Spark connector is using Kusto export for reading and by that produces parquet files with the new writer.
Possible solutions in case this is not an option:
Use Kusto Spark connector version 3.1.10, which checks the Spark version and disables the writer in the export command if version is less than 3.3.0.
Disable Spark configs:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false"). spark.conf.set("parquet.split.files", "false")
In cases non of the above solves the issue you may open a support ticket to ADX to disable the feature (this is a temporary solution)
Note- Synapse workspace will receive the connector updated version in the following days.

Parquet column issue while loading data to SQL Server using spark-submit

I am facing the following issue while migrating the data from hive to SQL Server using spark job with query given through JSON file.
Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file.
Column: [abc], Expected: string, Found: INT32
Now from what I understand is the parquet file contains different column structure than the Hive view. I am able to retrieve data using tools like Teradata, etc. While loading to different server causes the issue.
Can anyone help me understand the problem and give a workaround for the same?
Edit:
spark version 2.4.4.2
Scala version 2.11.12
Hive 2.3.6
SQL Server 2016

Parquet column cannot be converted: Expected decimal, Found binary

I'm using Apache Nifi 1.9.2 to load data from a relational database into Google Cloud Storage. The purpose is to write the outcome into Parquet files as it stores data in columnar way. To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor). The problem with these resulting files is that I cannot read Decimal typed columns when consuming the files in Spark 2.4.0 (scala 2.11.12): Parquet column cannot be converted ... Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
Links to parquet/avro example files:
https://drive.google.com/file/d/1PmaP1qanIZjKTAOnNehw3XKD6-JuDiwC/view?usp=sharing
https://drive.google.com/file/d/138BEZROzHKwmSo_Y-SNPMLNp0rj9ci7q/view?usp=sharing
As I know that Nifi works with the Avro format in between processors within the flowfile, I have also written the avro file (like it is just before the ConvertAvroToParquet processor) and this I can read in Spark.
It is also possible to not use logical types in Avro, but then I lose the column types in the end and all columns are Strings (not preferred).
I have also experimented with the PutParquet processor without success.
val arhg_parquet = spark.read.format("parquet").load("ARHG.parquet")
arhg_parquet.printSchema()
arhg_parquet.show(10,false)
printSchema() gives proper result, indicating ARHG3A is a decimal(2,0)
Executing the show(10,false) results in an ERROR: Parquet column cannot be converted in file file:///C:/ARHG.parquet. Column: [ARHG3A], Expected: decimal(2,0), Found: BINARY
To achieve this I make use of the ConvertAvroToParquet (default settings) processor in Nifi (followed by the PutGCSObject processor)
Try upgrading to NiFi 1.12.1, our latest release. Some improvements were made to handling decimals that might be applicable here. Also, you can use the Parquet reader and writer services to convert from Avro to Parquet now as of ~1.10.0. If that doesn't work, it may be a bug that should have a Jira ticket filed against it.

Does presto require a hive metastore to read parquet files from S3?

I am trying to generate parquet files in S3 file using spark with the goal that presto can be used later to query from parquet. Basically, there is how it looks like,
Kafka-->Spark-->Parquet<--Presto
I am able to generate parquet in S3 using Spark and its working fine. Now, I am looking at presto and what I think I found is that it needs hive meta store to query from parquet. I could not make presto read my parquet files even though parquet saves the schema. So, does it mean at the time of creating the parquet files, the spark job has to also store metadata in hive meta store?
If that is the case, can someone help me find an example of how it's done. To add to the problem, my data schema is changing, so to handle it, I am creating a programmatic schema in spark job and applying it while creating parquet files. And, if I am creating the schema in hive metastore, it needs to be done keeping this in consideration.
Or could you shed light on it if there is any better alternative way?
You keep the Parquet files on S3. Presto's S3 capability is a subcomponent of the Hive connector. As you said, you can let Spark define tables in Spark or you can use Presto for that, e.g.
create table hive.default.xxx (<columns>)
with (format = 'parquet', external_location = 's3://s3-bucket/path/to/table/dir');
(Depending on Hive metastore version and its configuration, you might need to use s3a instead of s3.)
Technically, it should be possible to create a connector that infers tables' schemata from Parquet headers, but I'm not aware of an existing one.

Convert Xml to Avro from Kafka to hdfs via spark streaming or flume

I want to convert xml files to avro. The data will be in xml format and will be hit the kafka topic first. Then, I can either use flume or spark-streaming to ingest and convert from xml to avro and land the files in hdfs. I have a cloudera enviroment.
When the avro files hit hdfs, I want the ability to read them into hive tables later.
I was wondering what is the best method to do this? I have tried automated schema conversion such as spark-avro (this was without spark-streaming) but the problem is spark-avro converts the data but hive cannot read it. Spark avro converts the xml to dataframe and then from dataframe to avro. The avro file can only be read by my spark application. I am not sure if I am using this correctly.
I think I will need to define an explicit schema for the avro schema. Not sure how to go about this for the xml file. It has multiple namespaces and is quite massive.
If you are on cloudera(since u have flume, may u have it), you can use morphline to work on conversion at record level. You can use batch/streaming. You can see here for more info.

Resources