Spark structured streaming over google cloud storage - apache-spark

I am running few batch Spark pipelines that consumes Avro data on google cloud storage. I need to update some pipelines to be more realtime and wondering if spark structured streaming can directly consume files from gcs in a streaming way i.e parkContext.readstream.from(...) can be applied to Avro files that are being generated continuously under a bucket from external sources.
Apache beam already has something like File.MatchAll().continuously(), Watch, watchnewFiles that allow beam pipelines to monitor for new files and read them in a streaming way (thus obviating the need of pubsub or notification system) , is there something similar for Spark structured streaming as well ?

As the GCS connector exposes a Hadoop-Compatible FileSystem (HCFS), "gs://" URIs should be valid targets for SparkSession.readStream.from.
Avro file handling is implemented by spark-avro. Using it with readStream should be accomplished the same way as generic reading (e.g., .format("com.databricks.spark.avro"))

Related

Spark-Streaming Checkpoints

I am trying to implement spark streaming checkpoints, using GCS as storage for checkpoints. On enabling the checkpointing causes the performance of the job to degrade. Just thinking if checkpoint can be done on sql or some other storage which would be faster then writing to HDFS or GCS.
Spark 3.x (and previous version) do not provide native support for checkpointing data directly to a SQL database. You have to checkpoint to a file system or a distributed file system like HDFS/GCS/S3.
Having said that you can write(and also then retrieving) your own custom checkpointing mechanism to a different destination.

Kafka log aggregation and processing

Hi I am trying to use Kafka as a log aggregator and filtering layer so they input into Splunk for eg.
Input side of Kafka will be Kafka S3 connectors and other connectors getting logs from S3 and Amazon Kinesis Data streams.See this pic for reference:
However what I want to know is inside the Kafka data pipeline for processing or filtering is it necessary to do Spark jobs? Or can that be just done with a simple Kafka streams app and if we have to do this design for several different logs what would be an efficient solution to implement. I am looking at a solution which we can replicate across different log streams without major changes each time.
Thank you
Spark (or Flink) can essentially replace Kafka Streams and Kafka Connect for transforming topics and writing to S3.
If you want to write directly to Splunk, then there is a Kafka Connector written explicitly for that, and you could use any Kafka client to consume+produce processed data before writing it downstream

Building a service with spark and spark streaming

I have read a bit about spark streaming and I would like to know if its possible to stream data from a custom source with rabbitmq as a broker and feed this data through the spark stream where Spark’s machine learning and graph processing algorithms will be performed on them and send it to other filesystems/databases/dashboards or customer receivers.
P.S I code with python, I do not have any experience using spark and Can I call what I'm trying to achieve a microservice?
Thank you.
I feel spark Structured streaming is more suitable and easy to implement rather than spark-streaming. Spark Structured Streaming follows the below concept
Source(read from RabbitMQ) -- Transformation (apply ML algo) -- Sink
(write to database)
You can refer this github project for an example on Spark structured streaming.
I don't think there is an inbuilt spark connector which can consume from rabbitMq. I know there is one for Kafka but you can write your own custom source and sink (Writing this without any spark knowledge might be tricky).
You can start this as a spark-job and you have to create a wrapper service layer which triggers this as a spark job (spark job launcher) or use spark rest api
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Is possible send data from Nifi to Spark Structured Streaming/Storm directly without loss data?

In my current scenario; Nifi collects data, then sends to Kafka. Then any streaming engine consumes data from kafka, and analysis it. In this situation; I dont want to use Kafka between Nifi and Streaming Engine. So, I want to send data from Nifi to streaming engine directly. But, I don't know some details here.
For example Spark Structured Streaming; Assumet that I send data from Nifi to Spark Structured Streaming directly, Spark was received this data but then spark's node is down. What happens to data in Spark node? (Do Spark Structured Streaming have any Nifi receiver?), Also, in this case, what is the data guarantee on Spark Structured Streaming?
For example Storm; Storm has Nifi Bolt. But, assume that Storm have received data from Nifi, but then node was down. What happens to the data? Also, in this case, what is the data guarantee on Storm?
In shortly, I want to send data from Nifi to SparkStructuredStreaming/Storm(I'm more likely to used Spark.) directly. But if any node is downs in streaming engine cluster, I dont want to lose data.
Is this possible for Spark Structured Streaming?
All of the streaming integration with NiFi is done using the site-to-site protocol, which is originally made for two NiFi instances to transfer data.
As far as I know there are currently integrations with Storm, Spark streaming, and Flink. I'm not familiar with Spark structured streaming, but I would imagine you could build this integration similar to the others.
https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver
https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout
https://github.com/apache/flink/tree/master/flink-connectors/flink-connector-nifi
NiFi is not a replayable source of data though. The data is transferred from NiFi to the streaming system in a transaction to ensure it is not removed from the NiFi side until the destination has confirmed the transaction. However, if something fails in the streaming system after that commit, then the data is no longer in NiFi and it is the streaming system's problem.
I'm not sure the reason why you don't want to use Kafka, but NiFi -> Kafka -> Streaming is a more standard and proven approach.
There is a NifiReceiver for spark.
Comparing the implementation with the apache-spark documentatation this receiver is fault tolerant, as it should replay data not passed on.

How to load streaming data from Amazon SQS?

I use Spark 2.2.0.
How can I feed Amazon SQS stream to spark structured stream using pyspark?
This question tries to answer it for a non structured streaming and for scala by creating a custom receiver.
Is something similar possible in pyspark?
spark.readStream \
.format("s3-sqs") \
.option("fileFormat", "json") \
.option("queueUrl", ...) \
.schema(...) \
.load()
According to Databricks above receiver can be used for S3-SQS file source. However, for only SQS how may one approach.
I tried understanding from AWS-SQS-Receive_Message to receive message. However, how to directly send stream to spark streaming was not clear.
I know nothing about Amazon SQS, but "how can I feed Amazon SQS stream to spark structured stream using pyspark." is not possible with any external messaging system or a data source using Spark Structured Streaming (aka Spark "Streams").
It's the other way round in Spark Structured Streaming when it is Spark to pull data in at regular intervals (similarly to the way Kafka Consumer API works where it pulls data in not is given it).
In other words, Spark "Streams" is just another consumer of messages from a "queue" in Amazon SQS.
Whenever I'm asked to integrate an external system with Spark "Streams" I start writing a client for the system using the client/consumer API.
Once I have it, the next step is to develop a custom streaming Source for the external system, e.g. Amazon SQS, using the sample client code above.
While developing a custom streaming Source you have to do the following steps:
Write a Scala class that implements the Source trait
Register the Scala class (the custom Source) with Spark SQL using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister file with the fully-qualified class name or use the fully-qualified class name in format
Having a custom streaming source is a two-part development with developing the source (and optionally registering it with Spark SQL) and using it in a Spark Structured Streaming application (in Python) by means of format method.

Resources