How to load a BigQuery table using deltaLake - apache-spark

I Have the below requirements,
I am processing source data using spark and then writing it into a Big query table.
Since spark doesn't have update/merge operation , I'm gonna use Delta lake along with spark.
I can write the source data to a delta lake path(In GCS bucket) after this, how can I sync this with bigquery Table?
Can we create an external bigquery table on this delta lake path?
or read data from this delta lake path into a DataFrameReader and then df.write to a big query table. Which is the best approach?

Related

how to register an existing delta table to hive

We are using spark for reading/writing data in delta format stored in HDFS (Databricks Delta table version 0.5.0).
We would like to utilize the power of Hive to interact with the delta tables.
How can we register an existing data in delta format from a path on HDFS to Hive?
Please note that currently we are running spark (2.4.0) on cloudera platform (CDH 6.3.3)
The only way I can do this so far is by registering it as an unmanaged table. The most significant difference, as far as I can tell, is that if you drop an unmanaged table, it does not drop the underlying data.

How to do an incremental load/upsert in spark-redshift

I have an ETL pipeline where data coming from redshift, reading the data in (py)spark dataframes, performing calculations and dumping back the result to some target in redshift. So the flow is => Redshift source schema--> Spark 3.0 --> Redshift target schema. This is done in EMR using spark-redshift library provided by databricks. But my data has million of records and doing a full load everytime is not a good option.
How can I perform incremental load/upserts in spark-redshift library, the option I wanted to go with is delta lake(open source and guarantees ACID) but we cannot simply read and write delta files to Redshift Spectrum using delta lake integration.
Please guide me how can i achieve this, also, if there are any alternatives.

Can Hive Read data from Delta lake file format?

I started going through DELTA LAKE file format, is hive capable of reading data from this newly introduced delta file format? If so could you please let me know the serde you were using.
Hive support is available with Delta Lake file format. First, step is to add the jars from https://github.com/delta-io/connectors, in our hive path. And then create a table using following format.
CREATE EXTERNAL TABLE test.dl_attempts_stream
(
...
)
STORED BY 'io.delta.hive.DeltaStorageHandler'
LOCATION
Delta Format picks up partition by default, so no need to mention partition while creating a table.
NOTE: If data is being inserted via a Spark job, please provide hive-site.xml, and enableHiveSupport in Spark Job, to create Delta Lake table in Hive.

SparkSQL on hive partitioned external table on amazon s3

I am planning to use SparkSQL (not pySpark) on top of data in Amazon S3. So I believe I need to create Hive external table and then can use SparkSQL. But S3 data is partitioned and want to have the partitions reflected in Hive external table also.
What is the best way to manage the hive table on a daily basis. Since
, everyday new partitions can be created or old partitions can be
overwritten and what to do , so as to keep the Hive external table
up-to-date?
Create a intermediate table and load to your hive table with insert overwrite partition on date.

Which component is better to move data from HDFS into Hive with some data transformation?

I need to load some data from HDFS to Hive. But I need to some aggregations between the files that I've in HDFS. I read that Sqoop can do that but only using MySQL. Which another choices I've to do this?
Thanks!
Your best option would be to create an external table in Hive that sources from your files in HDFS. Then you can create a Hive table to store your aggregated data and some Hive SQL to do the insert into that table.

Resources