How to use a Twitter stream source in Hazelcast Jet without needing a DAG? - hazelcast-jet

I want to do simple analysis on a live stream of tweets.
How do you use a Twitter stream source in Hazelcast Jet without needing a DAG?
Details
The encapsulation of Twitter API is pretty good at StreamTwitterP.java.
However, the caller uses that as part of a DAG, c/o:
Vertex twitterSource =
dag.newVertex("twitter", StreamTwitterP.streamTwitterP(properties, terms));
My use case doesn't need the power of DAG, so I'd rather avoid that needless extra complexity.
To avoid a DAG, I'm looking to use SourceBuilder to define a new data source for live stream of tweets.
I assume that would have code similar to StreamTwitterP.java, mentioned above, however it's not clear to me the fit using the API of Hazelcast JET.
I was referring to SourceBuilder example from the docs.

You can convert a processor to a pipeline source:
Pipeline p = Pipeline.create();
p.drawFrom(Sources.<String>streamFromProcessor("twitter",
streamTwitterP(properties, terms)))
...
There's also twitterSource version that uses SourceBuilder here.

Related

Can Kafka-Spark Streaming pair be used for both batch+real time data?

H All,
I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
But I have been told to use kafka-spark streaming pair for handling all kinds of data.
If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.
Thanks in advance!
What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).
Pipeline 1: Sqoop is a good choice for batch load, but it will slow in performance because underlying architecture is still on map-reduce. Though there are options to run sqoop on spark, but didn't try that. Once the data is in HDFS then you can use hive, which is great solution for batch processing. Having said that you can replace sqoop with Spark, if you are worrying about the RDMS fetch time. You can also do a batch transformations in spark also. I would say this is good solution.
Pipeline 2: Kafka and Spark streaming are the most obvious choice and is a good choice. But, If you are using Confluent dist. of Kafka then you could replace most of the spark transformations with K-SQL, K-Streams which will create a realtime transformations.
I would say, its good to have separate system for batching and one for real-time. This is what is lambda architecture. But if you are looking for a more unified framework, then you can try Apache Beam, which provides an unified framework for both batch and realtime processing. You can choose from multiple runners to execute your query.
Hope this helps :)
Lambda architecture would be the way to go!
Hope this link gives you enough ideas:
https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli
Thanks much.

Transparent Streaming & Batch processing

I'm still quite new to the world of stream and batch processing and trying to understnad concepts and speach. It is admitedly very possible that the answer to my question well known, easy to find or even answered a hundred times here at SO, but I was not able to find it.
The background:
I am working in a big scientific project (nuclear fusion research), and we are producing tons of measurement data during experiment runs. Those data are mostly streams of samples tagged with a nanosecond timestamp, where samples can be anything from a single by ADC value, via an array of such, via deeply structured data (with up to hundreds of entries from 1 bit booleans to 64bit double precision floats) to raw HD video frames or even string text messages. If I understand the common terminologies right, I would regard our data as "tabular data", for the most part.
We are working with mostly selfmade software solutions from data acquisition over simple online (streaming) analysis (like scaling, subsampling and such) to our own data sotrage, management and access facilities.
In view of the scale of the operation and the effort for maintaining all those implementations, we are investigating the possibilities to use standard frameworks and tools for more of our tasks.
My question:
In particular at this stage, we are facing the need for more and more sofisticated (automated and manual) data analytics on live/online/realtime data as well as "after the fact" offline/batch analytics of "historic" data. In this endavor, I am trying to understand if and how existing analytics frameworks like Spark, Flink, Storm etc. (possibly supported by message queues like Kafka, Pulsar,...) can support a scenario, where
data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Simply streaming the online data into storage and querying from there seems no option as we need both raw and analysed data for live monitoring and realtime feedback control of the experiment.
Also, letting the user query either a live input signal or a historic batch from storage differently would not be ideal, as our physicists mostly are no data scientists and we would like to keep such "technicalities" away from them and idealy the exact same algorithms should be used for analysing new real time data and old stored data from previous experiments.
Sitenotes:
we are talking about peek data loads in the range of 10th of gigabits per second coming in bursts of increasing length of seconds up to minutes - could this be handled by the candidates?
we are using timestamps in nanosecond resolution, even thinking about pico - this poses some limitations on the list of possible candidates if I unserstand correctly?
I would be very greatfull if anyone would be able to understand my question and to shed some light on the topic for me :-)
Many Thanks and kind regards,
Beppo
I don't think anyone can say "yes, framework X can definitely handle your workload", because it depends a lot on what you need out of your message processing, e.g. regarding messaging reliability, and how your data streams can be partitioned.
You may be interested in BenchmarkingDistributedStreamProcessingEngines. The paper is using versions of Storm/Flink/Spark that are a few years old (looks like they were released in 2016), but maybe the authors would be willing to let you use their benchmark to evaluate newer versions of the three frameworks?
A very common setup for streaming analytics is to go data source -> Kafka/Pulsar -> analytics framework -> long term data store. This decouples processing from data ingest, and lets you do stuff like reprocessing historical data as if it were new.
I think the first step for you should be to see if you can get the data volume you need through Kafka/Pulsar. Either generate a test set manually, or grab some data you think could be representative from your production environment, and see if you can put it through Kafka/Pulsar at the throughput/latency you need.
Remember to consider partitioning of your data. If some of your data streams could be processed independently (i.e. ordering doesn't matter), you should not be putting them in the same partitions. For example, there is probably no reason to mix sensor measurements and the video feed streams. If you can separate your data into independent streams, you are less likely to run into bottlenecks both in Kafka/Pulsar and the analytics framework. Separate data streams would also allow you to parallelize processing in the analytics framework much better, as you could run e.g. video feed and sensor processing on different machines.
Once you know whether you can get enough throughput through Kafka/Pulsar, you should write a small example for each of the 3 frameworks. To start, I would just receive and drop the data from Kafka/Pulsar, which should let you know early whether there's a bottleneck in the Kafka/Pulsar -> analytics path. After that, you can extend the example to do something interesting with the example data, e.g. do a bit of processing like what you might want to do in production.
You also need to consider which kinds of processing guarantees you need for your data streams. Generally you will pay a performance penalty for guaranteeing at-least-once or exactly-once processing. For some types of data (e.g. the video feed), it might be okay to occasionally lose messages. Once you decide on a needed guarantee, you can configure the analytics frameworks appropriately (e.g. disable acking in Storm), and try benchmarking on your test data.
Just to answer some of your questions more explicitly:
The live data analysis/monitoring use case sounds like it fits the Storm/Flink systems fairly well. Hooking it up to Kafka/Pulsar directly, and then doing whatever analytics you need sounds like it could work for you.
Reprocessing of historical data is going to depend on what kind of queries you need to do. If you simply need a time interval + id, you can likely do that with Kafka plus a filter or appropriate partitioning. Kafka lets you start processing at a specific timestamp, and if you data is partitioned by id or you filter it as the first step in your analytics, you could start at the provided timestamp and stop processing when you hit a message outside the time window. This only applies if the timestamp you're interested in is when the message was added to Kafka though. I also don't believe Kafka supports below-millisecond resolution on the timestamps it generates.
If you need to do more advanced queries (e.g. you need to look at timestamps generated by your sensors), you could look at using Cassandra or Elasticsearch or Solr as your permanent data store. You will also want to investigate how to get the data from those systems back into your analytics system. For example, I believe Spark ships with a connector for reading from Elasticsearch, while Elasticsearch provides a connector for Storm. You should check whether such a connector exists for your data store/analytics system combination, or be willing to write your own.
Edit: Elaborating to answer your comment.
I was not aware that Kafka or Pulsar supported timestamps specified by the user, but sure enough, they both do. I don't see that Pulsar supports sub-millisecond timestamps though?
The idea you describe can definitely be supported by Kafka.
What you need is the ability to start a Kafka/Pulsar client at a specific timestamp, and read forward. Pulsar doesn't seem to support this yet, but Kafka does.
You need to guarantee that when you write data into a partition, they arrive in order of timestamp. This means that you are not allowed to e.g. write first message 1 with timestamp 10, and then message 2 with timestamp 5.
If you can make sure you write messages in order to Kafka, the example you describe will work. Then you can say "Start at timestamp 'last night at midnight'", and Kafka will start there. As live data comes in, it will receive it and add it to the end of its log. When the consumer/analytics framework has read all the data from last midnight to current time, it will start waiting for new (live) data to arrive, and process it as it comes in. You can then write custom code in your analytics framework to make sure it stops processing when it reaches the first message with timestamp 'tomorrow night'.
With regard to support of sub-millisecond timestamps, I don't think Kafka or Pulsar will support it out of the box, but you can work around it reasonably easily. Just put the sub-millisecond timestamp in the message as a custom field. When you want to start at e.g. timestamp 9ms 10ns, you ask Kafka to start at 9ms, and use a filter in the analytics framework to drop all messages between 9ms and 9ms 10ns.
Allow me to add the following suggestions on how Apache Pulsar might help address some of your requirements. Food for thought as it were.
"data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such"
You might want to look at Pulsar Functions, which allows you to write simple functions (In Java or Python) that gets executed on each individual message that is published to a topic. They are ideal for this type of data augmentation use case.
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
Pulsar has recently added tiered-storage, that allows you to retain event streams in S3, Azure Blob Store, or Google Cloud storage. This would allow you to keep the data for years in a cheap and reliable data store
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Apache Pulsar has also added integration with the Presto query engine, which would allow you to query the data over a given time period (including data from tiered-storage) and place it into a topic for processing.

Workaround for joining two streams in structured streaming in Spark 2.x

I have a stream of configurations (not changed often, but if there's an update, it will be a message), and another stream of raw data points.
As I understand that for now spark doesn't support joining to streaming datasets or dataframes. Is there a good way to workaround this?
Is it possible to "snapshot" one of the streaming dataset to a static dataset (probably the configuration one, since it has less updates), then join with the other streaming dataset?
Open to suggestions!
"Workaround" is to use current master branch ;)
It's not released yet, but current master branch already has stream-stream inner join and there is outer join in progress. See this Jira ticket for reference, in sub-task you see possible joins to use.
There's no other easy workaround. Streaming joins requires saving state of streams and then correct updates of state. You can see code in pull requests, it's quite complex to implement stream-stream join.
So here is what I'm doing at the end.
Put the stream with less updates into a memory sink. Then do a select from the that table. By this time, it is a static instance and can be joined with the other stream. No trigger needed. Of course, you need to update the table correctly by yourself.
This is not very robust, but that's the best one I can come up with before the official support.

What is the most simple way to write to kafka from spark stream

I would like to write to kafka from spark stream data.
I know that I can use KafkaUtils to read from kafka.
But, KafkaUtils doesn't provide API to write to kafka.
I checked past question and sample code.
Is Above sample code the most simple way to write to kafka?
If I adopt way like above sample, I must create many classes...
Do you know more simple way or library to help to write to kafka?
Have a look here:
Basically this blog post summarise your possibilities which are written in different variations in the link you provided.
If we will look at your task straight forward, we can make several assumptions:
Your output data is divided to several partitions, which may (and quite often will) reside on different machines
You want to send the messages to Kafka using standard Kafka Producer API
You don't want to pass data between machines before the actual sending to Kafka
Given those assumptions your set of solution is pretty limited: You whether have to create a new Kafka producer for each partition and use it to send all the records of that partition, or you can wrap this logic in some sort of Factory / Sink but the essential operation will remain the same : You'll still request a producer object for each partition and use it to send the partition records.
I'll suggest you continue with one of the examples in the provided link, the code is pretty short, and any library you'll find would most probably do the exact same thing behind the scenes.

Parallelism of Streams in Spark Streaming Context

I have multiple input sources (~200) coming in on Kafka topics - the data for each is similar, but each must be run separately because there are differences in schemas - and we need to perform aggregate health checks on the feeds (so we can't throw them all into 1 topic in a simple way, without creating more work downstream). I've created a spark app with a spark streaming context, and everything seems to be working, except that it is only running the streams sequentially. There are certain bottlenecks in each stream which make this very inefficient, and I would like all streams to run at the same time - is this possible? I haven't been able to find a simple way to do this. I've seen the concurrentJobs parameter, but that doesn't worked as desired. Any design suggestions are also welcome, if there is not an easy technical solution.
Thanks
The answer was here:
https://spark.apache.org/docs/1.3.1/job-scheduling.html
with the fairscheduler.xml file.
By default it is FIFO... it only worked for me once I explicitly wrote the file (couldn't set it programmatically for some reason).

Resources