Spark structured streaming with Apache Hudi - apache-spark

I have a requirement where i need to write the stream using structured streaming to Hudi dataset. I found there is a provision to do this over Apache Hudi Jira issues but wanted to know if anyone successfully implemented this and have an example. I am trying to structure stream the data from AWS Kinesis Firehose to Apache Hudi using spark structured streaming
Quick help is appreciated.

I know of atleast one user using structure streaming sink in Hudi. https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/test/scala/DataSourceTest.scala#L190 could help?

Related

Spark structured streaming from JDBC source

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

GCP: Spark Structured Streaming + Custom Pub/Sub Source

There is currently no support for a PubSub source in Spark Structured Streaming.
Has anyone wrote a custom source in Spark Structured Streaming to read from PubSub?
Is this a possible approach?

How to write spark structured streaming data to REST API?

I would like to push my spark structured streaming processed data to the REST API. can someone share the examples of same. i have found few but all are related to spark streaming, not the structured streaming.
I have not heard about a REST API sink for Spark Structured Streaming, but you could write one yourself. Start from org.apache.spark.sql.execution.streaming.Source.
The easiest would however be to use DataStreamWriter.foreach or foreachBatch (since 2.4).

Is there a way to load streaming data from Kafka into HDFS using Spark and without Flume?

I was looking if there is a way to load the streaming data from Kafka directly into HDFS using spark streaming and without using Flume.
I have tried it using Flume(Kafka source and HDFS sink) already.
Thanks in Advance!
There is HDFS connector for Kafka Connect. Confluent's documentation have more information.
This is a pretty basic function for Spark Streaming. Depending on what version of spark and Kafka you are using, you can look at the spark streaming kafka integration documentation for the versions you are using. Saving to HDFS is as easy as rdd.saveAsTextFile("hdfs:///directory/filename").
Spark/Kafka integration guide for latest versions

How to send data from kafka to spark

I want to send my data from kafka to Spark.
I have installed spark in my system and kafka is also working in my system in proper way.
You need to use a Kafka connector from Spark. Technically, Kafka won't send the data to Spark. In fact, Spark pull the data from Kafka.
Here the link from the documentation : https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html

Resources