Loading Avro into BigQuery via the Spark Connector - apache-spark

From what I've seen in these example, it's only do-able via Gson. Is it possible to directly load Avro objects into a BigQuery table via the Spark Connector? Converting from Avro to BigQuery Json becomes a pain when the avro specification starts going beyond simple primitive values. (e.g. Unions)
Cheers

Not through Spark Connector, but BigQuery supports loading AVRO files directly: https://cloud.google.com/bigquery/loading-data#loading_avro_files

Related

Spark structured streaming from JDBC source

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

Spark structured streaming with Apache Hudi

I have a requirement where i need to write the stream using structured streaming to Hudi dataset. I found there is a provision to do this over Apache Hudi Jira issues but wanted to know if anyone successfully implemented this and have an example. I am trying to structure stream the data from AWS Kinesis Firehose to Apache Hudi using spark structured streaming
Quick help is appreciated.
I know of atleast one user using structure streaming sink in Hudi. https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/test/scala/DataSourceTest.scala#L190 could help?

Spark structured streaming over google cloud storage

I am running few batch Spark pipelines that consumes Avro data on google cloud storage. I need to update some pipelines to be more realtime and wondering if spark structured streaming can directly consume files from gcs in a streaming way i.e parkContext.readstream.from(...) can be applied to Avro files that are being generated continuously under a bucket from external sources.
Apache beam already has something like File.MatchAll().continuously(), Watch, watchnewFiles that allow beam pipelines to monitor for new files and read them in a streaming way (thus obviating the need of pubsub or notification system) , is there something similar for Spark structured streaming as well ?
As the GCS connector exposes a Hadoop-Compatible FileSystem (HCFS), "gs://" URIs should be valid targets for SparkSession.readStream.from.
Avro file handling is implemented by spark-avro. Using it with readStream should be accomplished the same way as generic reading (e.g., .format("com.databricks.spark.avro"))

Difference Between Spark SQL and Hive

Can you please help me to understand the difference between Spark SQl and Hive?
The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.
Built on top of Apache Hadoop, Hive provides the following features:
Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.
Access to files stored either directly in Apache HDFS or in other data storage systems such as Apache HBase
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
A mechanism to impose structure on a variety of data formats
Where as, Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing.
Spark SQL is a Spark module for structured data processing, in which in-memory processing is its core. Using Spark SQL, can read the data from any structured sources, like JSON, CSV, parquet, avro, sequencefiles, jdbc , hive etc.
Spark SQL can also be used to read data from an existing Hive installation. Thus, Spark SQL is the generalized module which can be used to process any structured data-source.

Data types supported in Storm and Spark

I am new to Storm and Spark. I just would like to ask how these two frameworks store files? Can they use HDFS? Also, can they support xml format?
Thanks,
I am not sure about spark but storm can write onto HDFS
https://github.com/apache/storm/tree/master/external/storm-hdfs
apart from just HDFS it allows you to write onto hbase, redis and gives you a jdbc connector module as well. you can see the list here -
https://github.com/apache/storm/tree/master/external
As far as Spark is concerned, it can read and write on HDFS and it supports xml format just as Hadoop does.

Resources