Why there is no JDBC Spark Streaming receiver? - apache-spark

I suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once.
I was surprised that there is no JDBC Spark Streaming receiver implementation. Implementing Receiver doesn't look difficult.
Could you describe why such receiver doesn't exist (is this approach a bad idea?) or provide links to implementations.
I've found Stratio/datasource-receiver. But it reads all data in a DataFrame before processing by Spark Streaming.
Thanks!

First of all actual streaming source would require a reliable mechanism for monitoring updates, which is simply not a part of JDBC interface nor it is a standardized (if at all) feature of major RDBMs, not to mention other platforms, which can be accessed through JDBC. It means that streaming from a source like this typically requires using log replication or similar facilities and is highly resource dependent.
At the same what you describe
suggest it's a good idea to process huge JDBC table by reading rows by batches and processing them with Spark Streaming. This approach doesn't require reading all rows into memory. I suppose no monitoring of new rows in the table, but just reading the table once
is really not an use case for streaming. Streaming deals with infinite streams of data, while you ask is simply as scenario for partitioning and such capabilities are already a part of the standard JDBC connector (either by range or by predicate).
Additionally receiver based solutions simply don't scale well and effectively model a sequential process. As a result their applications are fairly limited, and wouldn't be even less appealing if data was bounded (if you're going to read finite data sequentially on a single node, there is no value in adding Spark to the equation).

I don't think it is a bad idea since in some cases you have constraints that are outside your power,e.g. legacy systems to which you cannot apply strategies such as CDC but to which you still have to consume as a source of stream data.
On the other hand, Spark Structure Streaming engine, in micro-batch mode, requires the definition of an offset than can be advanced, as you can see in this class. So, if your table has some column that can be used as an offset, you can definitely stream from it, although RDMDS are not the "streaming-friendly" as far as I know.
I have developed Jdbc2s which is a DataSource V1 streaming source for Spark. It's also deployed to Maven Central, if you need. Coordinates are in the documentation.

Related

Is windowing based on event time possible with Spark Streaming?

According to the Dataflow Model paper : A practical approach to balancing correctness, latency and cost in massive-scale, unbounded, out-of-order Data processing:
MillWheel and Spark Streaming are both sufficiently scalable,
fault-tolerant, and low-latency to act as reasonable substrates, but
lack high-level programming models that make calculating event-time
sessions straightforward.
Is it always the case?
No, it is not.
To quote from https://dzone.com/articles/spark-streaming-vs-structured-streaming so as to save on my lunch time!:
One big issue in the streaming world is how to process data according
to event-time.
Event-time is the time when the event actually happened. It is not
necessary for the source of the streaming engine to prove data in
real-time. There may be latencies in data generation and handing over
the data to the processing engine. There is no such option in Spark
Streaming to work on the data using the event-time. It only works with
the timestamp when the data is received by the Spark. Based on the
ingestion timestamp, Spark Streaming puts the data in a batch even if
the event is generated early and belonged to the earlier batch, which
may result in less accurate information as it is equal to the data
loss.
On the other hand, Structured Streaming provides the functionality to
process data on the basis of event-time when the timestamp of the
event is included in the data received. This is a major feature
introduced in Structured Streaming which provides a different way of
processing the data according to the time of data generation in the
real world. With this, we can handle data coming in late and get more
accurate results.
With event-time handling of late data, Structured Streaming outweighs
Spark Streaming.

Spark Ingestion path: "Source to Driver to Worker" or "Source to Workers"

When Spark ingest the Data, is there specific situation where it has to go trough the driver and then from the driver the worker ? Same question apply for a direct read by the worker.
I guess i am simply trying to map out what are the condition or situation that lead to one way or the other, and how does partitioning happen in each case.
If you limit yourself to built-in methods then unless you create distributed data structure from a local one with method like:
SparkSession.createDataset
SparkContext.parallelize
data is always accessed directly by the workers, but the details of the data distribution will vary from source to source.
RDDs typically depend on Hadoop input formats, but Spark SQL and data source API, are at least partially independent, at least when it comes to configuration,
It doesn't mean data is always properly distributed. In some cases (JDBC, streaming receivers) data may still be piped trough a single node.

Use Akka with apache spark streaming & Kafka?

Below is the high level usecase which im trying to workon.
we have stream of students data published into a Kafka topic and our module has to read the student ids as stream and fetch associated data from multiple sources for each student and perform some calculation for each student and publish the associated calculation for each student into a kafka topic.
So here the question is it better to write a single big Spark job or use Akka to have separate service for each source so that actors can work parallely take bunch of student ids and get the data from respective source and perform some bunch Transformations and actions and finally a calculation associated with each student .
Or do i really need to use Akka here? Will Spark handles this efficiently internally?
Appreciate any thoughts here.
If your transformations take data from Kafka as input and produce output back into Kafka, it appears the most natural fit is Kafka Streams. I'd look to that first. Kafka Streams take advantage of the partitioning of data on Kafka to process partition groups in parallel to each other, but process messages sequentially within in each group, similarly how akka actors work in parallel to each other but each actor internally processes messages sequentially.
However, if your calculation requires e.g. machine learning or in general some iterative data-processing which does re-partitioning (shuffling in spark lingo) of the data between iterations, then Kafka Streams would no longer be that good a fit, I think. Then I'd consider Spark or Flink.
Akka is really powerful and you can use it in both these cases and more. However, it's a lower level library than Kafka Streams, Spark or Flink. Which means you have more power but also more considerations to think about. If using akka, I'd go for akka-streams. They have a good integration with kafka via the akka-stream-kafka (aka reactive-kafka) library.

What is the best way to store incoming streaming data?

What is a better choice for a long-term store (many writes, few reads) of data processed through Spark Streaming: Parquet, HBase or Cassandra? Or something else? What are the trade-offs?
In my experience we have used Hbase as datastore for spark streaming data(we also has same scenario many writes and few reads), since we are using hadoop, hbase has native integration with hadoop and it went well..
Above we have used tostore hight rate of messages coming over from solace.
HBase is well suited for doing Range based scans. Casandra is known for availablity and many other things...
However, I can also observe one general trend in many projects, they are simply storing rawdata in hdfs (parquet + avro) in partitioned structure through spark streaming with spark dataframe(SaveMode.Append) and they are processing rawdata with Spark
Ex of partitioned structure in hdfs :
completion ofbusinessdate/environment/businesssubtype/message type etc....
in this case there is no need for going to Hbase or any other data store.
But one common issue in above approach is when you are getting small and tiny files, through streaming then you would need to repartion(1) or colelese or FileUtils.copymerge to meet block size requirements to single partitioned file. Apart from that above approach also would be fine.
Here is some thing called CAP theorm based on which decision can be taken.
Consistency (all nodes see the same data at the same time).
Availability (every request receives a response about whether it
succeeded or failed).
Partition tolerance (the system continues to
operate despite arbitrary partitioning due to network failures)
Casandra supports AP.
Hbase supports CP.
Look at detailed analysis given here

Spark Streaming with large number of streams and models used for analytical processing of RDDs

We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?
Or
Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..
Please suggest.
Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.
The distributed datastore (Redis, Cassandra, etc.) can have the data you want to access from all the nodes.
Receive the data streams through a combination data ingestion system (Kafka, Flume, ZeroMQ, etc.) and process it in the stream processing system (Spark Streaming [preferably ;)], Storm, etc.).
In the functions that is used to process the stream records, the necessary data will have to pulled from the data store and maybe cached locally as appropriate.
You may also have to update the data store from spark streaming as application needs it. In which case you will also have to worry about versioning of the data that you want pull in step 3.
Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!

Resources