How to read from InputStream? - apache-spark

I have an InputStream coming from the source and want to read the data using Spark Streaming.
How to do that?

My very rough understanding of Spark Streaming says to use ssc.receiverStream with a custom Receiver.
Consult Spark Streaming Custom Receivers.

Spark Streaming API provides various methods for reading data from different sources - As of Spark 1.5.2 it provides utility functions to read streaming data from: -
Binary or Text Files or Directories
Raw Sockets
Queues
For any other data source you need to provide the custom implementation of Streaming Receivers and further leverage following method of StreamingContext
StreamingContext.receiverStream[T](receiver: Receiver[T])
Details about implementing custom Streaming receivers can be found here.

Related

Spark structured streaming from JDBC source

Can someone let me know if its possible to to Spark structured streaming from a JDBC source? E.g SQL DB or any RDBMS.
I have looked at a few similar questions on SO, e.g
Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading
jdbc source and spark structured streaming
However, I would like to know if its officially supported on Apache Spark?
If there is any sample code that would be helpful.
Thanks
No, there is no such built-in support in Spark Structured Streaming. The main reason is that most of databases doesn't provided an unified interface for obtaining the changes.
It's possible to get changes from some databases using archive logs, write-ahead logs, etc. But it's database-specific. For many databases the popular choice is Debezium that can read such logs and push list of changes into a Kafka, or something similar, from which it could be consumed by Spark.
I am on a project now architecting this using CDC Shareplex from ORACLE and writing to KAFKA and then using Spark Structured Streaming with KAFKA integration and MERGE on delta format on HDFS.
Ie that is the way to do it if not using Debezium. You can use change logs for base tables or materialized views to feed CDC.
So direct JDBC is not possible.

Building a service with spark and spark streaming

I have read a bit about spark streaming and I would like to know if its possible to stream data from a custom source with rabbitmq as a broker and feed this data through the spark stream where Spark’s machine learning and graph processing algorithms will be performed on them and send it to other filesystems/databases/dashboards or customer receivers.
P.S I code with python, I do not have any experience using spark and Can I call what I'm trying to achieve a microservice?
Thank you.
I feel spark Structured streaming is more suitable and easy to implement rather than spark-streaming. Spark Structured Streaming follows the below concept
Source(read from RabbitMQ) -- Transformation (apply ML algo) -- Sink
(write to database)
You can refer this github project for an example on Spark structured streaming.
I don't think there is an inbuilt spark connector which can consume from rabbitMq. I know there is one for Kafka but you can write your own custom source and sink (Writing this without any spark knowledge might be tricky).
You can start this as a spark-job and you have to create a wrapper service layer which triggers this as a spark job (spark job launcher) or use spark rest api
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Reading a Message from Kafka and writing to HDFS

I'm looking for the best way to read messages (alot of messages, around 100B each day) from Kafka, after reading the message I need to make manipulate on data and write it into HDFS.
If I need to do it with the best performance, What is the best way for me to read messages from Kafka and write file into HDFS?
Which programming language is best for that?
Do I need to consider to use solutions like Spark for that?
You should use Spark streaming for this (see here), it provides simple correspondence between Kafka partitions and Spark partitions.
Or you can use Use Kafka Streams (see more). Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters.
You can use Spark, Flink, NiFi, Streamsets... but Confluent provides Kafka Connect HDFS exactly for this purpose.
The Kafka Connect API is somewhat limited in transformations, so what most people do is to write a Kafka Streams job to filter/enhance the data to a secondary topic, which then is written to HDFS
Note: These options will write many files to HDFS (generally, one per Kafka topic partition)
Which programming language is best for that?
Each of the above are using Java. But you don't need to write any code yourself if using NiFi, Streamsets, or Kafka Connect

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

How to load streaming data from Amazon SQS?

I use Spark 2.2.0.
How can I feed Amazon SQS stream to spark structured stream using pyspark?
This question tries to answer it for a non structured streaming and for scala by creating a custom receiver.
Is something similar possible in pyspark?
spark.readStream \
.format("s3-sqs") \
.option("fileFormat", "json") \
.option("queueUrl", ...) \
.schema(...) \
.load()
According to Databricks above receiver can be used for S3-SQS file source. However, for only SQS how may one approach.
I tried understanding from AWS-SQS-Receive_Message to receive message. However, how to directly send stream to spark streaming was not clear.
I know nothing about Amazon SQS, but "how can I feed Amazon SQS stream to spark structured stream using pyspark." is not possible with any external messaging system or a data source using Spark Structured Streaming (aka Spark "Streams").
It's the other way round in Spark Structured Streaming when it is Spark to pull data in at regular intervals (similarly to the way Kafka Consumer API works where it pulls data in not is given it).
In other words, Spark "Streams" is just another consumer of messages from a "queue" in Amazon SQS.
Whenever I'm asked to integrate an external system with Spark "Streams" I start writing a client for the system using the client/consumer API.
Once I have it, the next step is to develop a custom streaming Source for the external system, e.g. Amazon SQS, using the sample client code above.
While developing a custom streaming Source you have to do the following steps:
Write a Scala class that implements the Source trait
Register the Scala class (the custom Source) with Spark SQL using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister file with the fully-qualified class name or use the fully-qualified class name in format
Having a custom streaming source is a two-part development with developing the source (and optionally registering it with Spark SQL) and using it in a Spark Structured Streaming application (in Python) by means of format method.

Resources