How to insert data from Kafka to Kudu using Spark streaming - apache-spark

I have a Spark streaming application that listens to a Kafka topic.
When getting the data I need to process it and send to Kudu.
Currently I am using org.apache.kudu.spark.kudu.KuduContext API and call the insert action with the data frame.
In order to create the data frame from my data I need to call collect() so I can create the data frame using sqlContext.
Is there a way to create the dataframe/insert the data into Kudu without calling collect() which is of course costly?
We are using Spark 1.6

Kudu sink for Spark now supports structured streaming: https://issues.apache.org/jira/browse/KUDU-2640

Related

Building a service with spark and spark streaming

I have read a bit about spark streaming and I would like to know if its possible to stream data from a custom source with rabbitmq as a broker and feed this data through the spark stream where Spark’s machine learning and graph processing algorithms will be performed on them and send it to other filesystems/databases/dashboards or customer receivers.
P.S I code with python, I do not have any experience using spark and Can I call what I'm trying to achieve a microservice?
Thank you.
I feel spark Structured streaming is more suitable and easy to implement rather than spark-streaming. Spark Structured Streaming follows the below concept
Source(read from RabbitMQ) -- Transformation (apply ML algo) -- Sink
(write to database)
You can refer this github project for an example on Spark structured streaming.
I don't think there is an inbuilt spark connector which can consume from rabbitMq. I know there is one for Kafka but you can write your own custom source and sink (Writing this without any spark knowledge might be tricky).
You can start this as a spark-job and you have to create a wrapper service layer which triggers this as a spark job (spark job launcher) or use spark rest api
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Is possible send data from Nifi to Spark Structured Streaming/Storm directly without loss data?

In my current scenario; Nifi collects data, then sends to Kafka. Then any streaming engine consumes data from kafka, and analysis it. In this situation; I dont want to use Kafka between Nifi and Streaming Engine. So, I want to send data from Nifi to streaming engine directly. But, I don't know some details here.
For example Spark Structured Streaming; Assumet that I send data from Nifi to Spark Structured Streaming directly, Spark was received this data but then spark's node is down. What happens to data in Spark node? (Do Spark Structured Streaming have any Nifi receiver?), Also, in this case, what is the data guarantee on Spark Structured Streaming?
For example Storm; Storm has Nifi Bolt. But, assume that Storm have received data from Nifi, but then node was down. What happens to the data? Also, in this case, what is the data guarantee on Storm?
In shortly, I want to send data from Nifi to SparkStructuredStreaming/Storm(I'm more likely to used Spark.) directly. But if any node is downs in streaming engine cluster, I dont want to lose data.
Is this possible for Spark Structured Streaming?
All of the streaming integration with NiFi is done using the site-to-site protocol, which is originally made for two NiFi instances to transfer data.
As far as I know there are currently integrations with Storm, Spark streaming, and Flink. I'm not familiar with Spark structured streaming, but I would imagine you could build this integration similar to the others.
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-nifi
NiFi is not a replayable source of data though. The data is transferred from NiFi to the streaming system in a transaction to ensure it is not removed from the NiFi side until the destination has confirmed the transaction. However, if something fails in the streaming system after that commit, then the data is no longer in NiFi and it is the streaming system's problem.
I'm not sure the reason why you don't want to use Kafka, but NiFi -> Kafka -> Streaming is a more standard and proven approach.
There is a NifiReceiver for spark.
Comparing the implementation with the apache-spark documentatation this receiver is fault tolerant, as it should replay data not passed on.

How to write spark structured streaming data to REST API?

I would like to push my spark structured streaming processed data to the REST API. can someone share the examples of same. i have found few but all are related to spark streaming, not the structured streaming.
I have not heard about a REST API sink for Spark Structured Streaming, but you could write one yourself. Start from org.apache.spark.sql.execution.streaming.Source.
The easiest would however be to use DataStreamWriter.foreach or foreachBatch (since 2.4).

Save each Kafka messages in hdfs using spark streaming

I am using spark streaming to do analysis. after analysis I have to save the kafka message in hdfs. Each kafka message is a xml file. I can't use rdd.saveAsTextFile because it will save whole rdd. Each element of rdd is kafka message ( xml file ). How to save each rdd element (file) in hdfs using spark.
I would go about this a different way. Stream your transformed data back into Kafka, and then use the HDFS connector for Kafka Connect to stream the data to HDFS. Kafka Connect is part of Apache Kafka. The HDFS connector is open source and available standalone or as part of Confluent Platform.
Doing it this way you decouple your processing from writing your data to HDFS, which makes it easier to manage, to troubleshoot, to scale.

How to load streaming data from Amazon SQS?

I use Spark 2.2.0.
How can I feed Amazon SQS stream to spark structured stream using pyspark?
This question tries to answer it for a non structured streaming and for scala by creating a custom receiver.
Is something similar possible in pyspark?
spark.readStream \
.format("s3-sqs") \
.option("fileFormat", "json") \
.option("queueUrl", ...) \
.schema(...) \
.load()
According to Databricks above receiver can be used for S3-SQS file source. However, for only SQS how may one approach.
I tried understanding from AWS-SQS-Receive_Message to receive message. However, how to directly send stream to spark streaming was not clear.
I know nothing about Amazon SQS, but "how can I feed Amazon SQS stream to spark structured stream using pyspark." is not possible with any external messaging system or a data source using Spark Structured Streaming (aka Spark "Streams").
It's the other way round in Spark Structured Streaming when it is Spark to pull data in at regular intervals (similarly to the way Kafka Consumer API works where it pulls data in not is given it).
In other words, Spark "Streams" is just another consumer of messages from a "queue" in Amazon SQS.
Whenever I'm asked to integrate an external system with Spark "Streams" I start writing a client for the system using the client/consumer API.
Once I have it, the next step is to develop a custom streaming Source for the external system, e.g. Amazon SQS, using the sample client code above.
While developing a custom streaming Source you have to do the following steps:
Write a Scala class that implements the Source trait
Register the Scala class (the custom Source) with Spark SQL using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister file with the fully-qualified class name or use the fully-qualified class name in format
Having a custom streaming source is a two-part development with developing the source (and optionally registering it with Spark SQL) and using it in a Spark Structured Streaming application (in Python) by means of format method.

Resources