Actually, in our architecture, we use Hazelcast IMDG, in order to share information about user operativities among several server nodes.
Our map has the following structure: [key:String|value: CustomObject].
Now, we want to expand our product functionality and we want to develop a real-time dashboard performing real-time data stream by doing:
Complex Aggregation
Continuous Query
etc.
At the end of the process we want to “send” the result to a Vert.x Eventbus and then to a socket layer (SockJS), in order to show the data in a dashboard.
Our need is to set-up a scalable and fast system, in order to handle a heavy amount of data such as thousands of events per second.
The first image represent our current (old) architecture, the second image represent our “target” architecture.
Old Architecture
Target Architecture
What do you think about target architecture?
Is the role of Hazelcast Jet correct or are there another way to perform these operations (for example only with Hazelcast IMDG)?
Thanks in advance.
Looks like a good fit for Hazelcast Jet. You probably will use Sources.mapJournal() to process entries as they are added to the IMap. You can aggregate into sliding windows easily. Writing a Vert.x Event Bus sink should be straightforward with SinkBuilder. Thousands events/s is a low figure, it depends on how much work will you do with each event.
Related
As you know, Kappa architecture is some kind of simplification of Lambda architecture. Kappa doesn't need batch layer, instead speed layer have to guarantee computation precision and enough throughput (more parallelism/resources) on historical data re-computation.
Still Kappa architecture requires two serving layers in case when you need to do analytic based on historical data. For example, data that have age < 2 weeks are stored at Redis (streaming serving layer), while all older data are stored somewhere at HBase (batch serving layer).
When (due to Kappa architecture) I have to insert data to batch serving layer?
If streaming layer inserts data immidiately to both batch & stream serving layers - than how about late data arrival? Or streaming layer should backup speed serving layer to batch serving layer on regular basis?
Example: let say source of data is Kafka, data are processed by Spark Structured Streaming or Flink, sinks are Redis and HBase. When write to Redis & HBase should happen?
If we perform stream processing, we want to make sure that output data is firstly made available as a data stream. In your example that means we write to Kafka as a primary sink.
Now you have two options:
have secondary jobs that reads from that Kafka topic and writes to Redis and HBase. That is the Kafka way, in that Kafka Streams does not support writing directly to any of these systems and you set up a Kafka connect job. These secondary jobs can then be tailored to the specific sinks, but they add additional operations overhead. (That's a bit of the backup option that you mentioned).
with Spark and Flink you also have the option to have secondary sinks directly in your job. You may add additional processing steps to transform the Kafka output into a more suitable form for the sink, but you are more limited when configuring the job. For example in Flink, you need to use the same checkpointing settings for the Kafka sink and the Redis/HBase sink. Nevertheless, if the settings work out, you just need to run one streaming job instead of 2 or 3.
Late events
Now the question is what to do with late data. The best solution is to let the framework handle that through watermarks. That is, data is only committed at all sinks, when the framework is sure that no late data arrives. If that doesn't work out because you really need to process late events even if they arrive much, much later and still want to have temporary results, you have to use update events.
Update events
(as requested by the OP, I will add more details to the update events)
In Kafka Streams, elements are emitted through a continuous refinement mechanism by default. That means, windowed aggregations emit results as soon as they have any valid data point and update that result while receiving new data. Thus, any late event is processed and yield an updated result. While this approach nicely lowers the burden to users, as they do not need to understand watermarks, it has some severe short-comings that led the Kafka Streams developers to add Suppression in 2.1 and onward.
The main issue is that it poses quite big challenges to downward users to process intermediate results as also explained in the article about Suppression. If it's not obvious if a result is temporary or "final" (in the sense that all expected events have been processed) then many applications are much harder to implement. In particular, windowing operations need to be replicated on consumer side to get the "final" value.
Another issue is that the data volume is blown up. If you'd have a strong aggregation factor, using watermark-based emission will reduce your data volume heavily after the first operation. However, continuous refinement will add a constant volume factor as each record triggers a new (intermediate) record for all intermediate steps.
Lastly, and particularly interesting for you is how to offload data to external systems if you have update events. Ideally, you would offload the data with some time lag continuously or periodically. That approach simulates the watermark-based emission again on consumer side.
Mixing the options
It's possible to use watermarks for the initial emission and then use update events for late events. The volume is then reduced for all "on-time" events. For example, Flink offers allowed lateness to make windows trigger again for late events.
This setup makes offloading data much easier as data only needs to be re-emitted to the external systems if a late event actually happened. The system should be tweaked that a late event is a rare case though.
I'm still quite new to the world of stream and batch processing and trying to understnad concepts and speach. It is admitedly very possible that the answer to my question well known, easy to find or even answered a hundred times here at SO, but I was not able to find it.
The background:
I am working in a big scientific project (nuclear fusion research), and we are producing tons of measurement data during experiment runs. Those data are mostly streams of samples tagged with a nanosecond timestamp, where samples can be anything from a single by ADC value, via an array of such, via deeply structured data (with up to hundreds of entries from 1 bit booleans to 64bit double precision floats) to raw HD video frames or even string text messages. If I understand the common terminologies right, I would regard our data as "tabular data", for the most part.
We are working with mostly selfmade software solutions from data acquisition over simple online (streaming) analysis (like scaling, subsampling and such) to our own data sotrage, management and access facilities.
In view of the scale of the operation and the effort for maintaining all those implementations, we are investigating the possibilities to use standard frameworks and tools for more of our tasks.
My question:
In particular at this stage, we are facing the need for more and more sofisticated (automated and manual) data analytics on live/online/realtime data as well as "after the fact" offline/batch analytics of "historic" data. In this endavor, I am trying to understand if and how existing analytics frameworks like Spark, Flink, Storm etc. (possibly supported by message queues like Kafka, Pulsar,...) can support a scenario, where
data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Simply streaming the online data into storage and querying from there seems no option as we need both raw and analysed data for live monitoring and realtime feedback control of the experiment.
Also, letting the user query either a live input signal or a historic batch from storage differently would not be ideal, as our physicists mostly are no data scientists and we would like to keep such "technicalities" away from them and idealy the exact same algorithms should be used for analysing new real time data and old stored data from previous experiments.
Sitenotes:
we are talking about peek data loads in the range of 10th of gigabits per second coming in bursts of increasing length of seconds up to minutes - could this be handled by the candidates?
we are using timestamps in nanosecond resolution, even thinking about pico - this poses some limitations on the list of possible candidates if I unserstand correctly?
I would be very greatfull if anyone would be able to understand my question and to shed some light on the topic for me :-)
Many Thanks and kind regards,
Beppo
I don't think anyone can say "yes, framework X can definitely handle your workload", because it depends a lot on what you need out of your message processing, e.g. regarding messaging reliability, and how your data streams can be partitioned.
You may be interested in BenchmarkingDistributedStreamProcessingEngines. The paper is using versions of Storm/Flink/Spark that are a few years old (looks like they were released in 2016), but maybe the authors would be willing to let you use their benchmark to evaluate newer versions of the three frameworks?
A very common setup for streaming analytics is to go data source -> Kafka/Pulsar -> analytics framework -> long term data store. This decouples processing from data ingest, and lets you do stuff like reprocessing historical data as if it were new.
I think the first step for you should be to see if you can get the data volume you need through Kafka/Pulsar. Either generate a test set manually, or grab some data you think could be representative from your production environment, and see if you can put it through Kafka/Pulsar at the throughput/latency you need.
Remember to consider partitioning of your data. If some of your data streams could be processed independently (i.e. ordering doesn't matter), you should not be putting them in the same partitions. For example, there is probably no reason to mix sensor measurements and the video feed streams. If you can separate your data into independent streams, you are less likely to run into bottlenecks both in Kafka/Pulsar and the analytics framework. Separate data streams would also allow you to parallelize processing in the analytics framework much better, as you could run e.g. video feed and sensor processing on different machines.
Once you know whether you can get enough throughput through Kafka/Pulsar, you should write a small example for each of the 3 frameworks. To start, I would just receive and drop the data from Kafka/Pulsar, which should let you know early whether there's a bottleneck in the Kafka/Pulsar -> analytics path. After that, you can extend the example to do something interesting with the example data, e.g. do a bit of processing like what you might want to do in production.
You also need to consider which kinds of processing guarantees you need for your data streams. Generally you will pay a performance penalty for guaranteeing at-least-once or exactly-once processing. For some types of data (e.g. the video feed), it might be okay to occasionally lose messages. Once you decide on a needed guarantee, you can configure the analytics frameworks appropriately (e.g. disable acking in Storm), and try benchmarking on your test data.
Just to answer some of your questions more explicitly:
The live data analysis/monitoring use case sounds like it fits the Storm/Flink systems fairly well. Hooking it up to Kafka/Pulsar directly, and then doing whatever analytics you need sounds like it could work for you.
Reprocessing of historical data is going to depend on what kind of queries you need to do. If you simply need a time interval + id, you can likely do that with Kafka plus a filter or appropriate partitioning. Kafka lets you start processing at a specific timestamp, and if you data is partitioned by id or you filter it as the first step in your analytics, you could start at the provided timestamp and stop processing when you hit a message outside the time window. This only applies if the timestamp you're interested in is when the message was added to Kafka though. I also don't believe Kafka supports below-millisecond resolution on the timestamps it generates.
If you need to do more advanced queries (e.g. you need to look at timestamps generated by your sensors), you could look at using Cassandra or Elasticsearch or Solr as your permanent data store. You will also want to investigate how to get the data from those systems back into your analytics system. For example, I believe Spark ships with a connector for reading from Elasticsearch, while Elasticsearch provides a connector for Storm. You should check whether such a connector exists for your data store/analytics system combination, or be willing to write your own.
Edit: Elaborating to answer your comment.
I was not aware that Kafka or Pulsar supported timestamps specified by the user, but sure enough, they both do. I don't see that Pulsar supports sub-millisecond timestamps though?
The idea you describe can definitely be supported by Kafka.
What you need is the ability to start a Kafka/Pulsar client at a specific timestamp, and read forward. Pulsar doesn't seem to support this yet, but Kafka does.
You need to guarantee that when you write data into a partition, they arrive in order of timestamp. This means that you are not allowed to e.g. write first message 1 with timestamp 10, and then message 2 with timestamp 5.
If you can make sure you write messages in order to Kafka, the example you describe will work. Then you can say "Start at timestamp 'last night at midnight'", and Kafka will start there. As live data comes in, it will receive it and add it to the end of its log. When the consumer/analytics framework has read all the data from last midnight to current time, it will start waiting for new (live) data to arrive, and process it as it comes in. You can then write custom code in your analytics framework to make sure it stops processing when it reaches the first message with timestamp 'tomorrow night'.
With regard to support of sub-millisecond timestamps, I don't think Kafka or Pulsar will support it out of the box, but you can work around it reasonably easily. Just put the sub-millisecond timestamp in the message as a custom field. When you want to start at e.g. timestamp 9ms 10ns, you ask Kafka to start at 9ms, and use a filter in the analytics framework to drop all messages between 9ms and 9ms 10ns.
Allow me to add the following suggestions on how Apache Pulsar might help address some of your requirements. Food for thought as it were.
"data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such"
You might want to look at Pulsar Functions, which allows you to write simple functions (In Java or Python) that gets executed on each individual message that is published to a topic. They are ideal for this type of data augmentation use case.
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
Pulsar has recently added tiered-storage, that allows you to retain event streams in S3, Azure Blob Store, or Google Cloud storage. This would allow you to keep the data for years in a cheap and reliable data store
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Apache Pulsar has also added integration with the Presto query engine, which would allow you to query the data over a given time period (including data from tiered-storage) and place it into a topic for processing.
I am looking for a good, up to date and "decision helping" explanation on how to choose a NoSQL database engine for storing all the events in a CQRS designed application.
I am currently a newcomer to all things around NoSQL (but learning): please be clear and do not hesitate to explain your point of view in an (almost too much) precise manner. This post may deserve other newcomers like me.
This database will:
Be able to insert 2 to 10 rows per updates asked by the front view (in my case, updates are frequent). Think of thousand of updates per minute, how would it scale?
Critically need to be consistent and failure safe, since events are the source of truth of the application
Not need any link between entities (like RDBMS does) except maybe a user ID/GUID (I don't know if it's critical or needed yet)
Receive events containing 3 to 10 "columns" (a sequence ID, an event name, a datetime, a JSON/binary encoded parameter bag, some context informations..). Without orientating your point of view in a column-oriented type of database, it may be document-oriented if it fits all other requirements
Be used as a queue or sent to/read from an external AMQP system like RabbitMQ or ZeroMQ (didn't worked that part yet, if you could also argument/explain..) since view projections will be built upon events
Need some kind of filtering by sequence ID like SELECT * FROM events WHERE sequence_id > last_sequence_id for subscribers (or queue systems) to be able to synchronize from a given point
I heard of HBase for CQRS event storing, but maybe MongoDB could fit? Or even Elasticsearch (would not bet on that one..)? I'm also open to RDBMS for consistency and availability.. but what about the partition tolerance part..?
Really I'm lost, I need arguments to make a pertinent choice.
https://geteventstore.com/ is a database designed specifically for event streams.
They take consistency and reliability of the source of truth (your events) very seriously and I use it myself to read/write thousands of events a second.
I have a working, in production implementation of MongoDB as an Event store. It is used by a CQRS + Event sourcing web based CRM application.
In order to provide 100% transaction-less but transaction-like guarantee for persisting multiple events in one go (all events or none of them) I use a MongoDB document as an events commit, with events as nested documents. As you know, MongoDB has document level locking.
For concurrency I use optimistic locking, using a version property for each Aggregate steam. An Aggregate stream is identified by the dublet (Aggregate class x Aggregate ID).
The event store also stores the commits in relative order using a sequence on each commit, incremented on each commit, protected using optimistic locking.
Each commit contains the following:
aggregateId : string, probably a GUID,
aggregateClass: string,
version: integer, incremented for each aggregateId x aggregateClass,
sequence, integer, incremented for each commit,
createdAt: UTCDateTime,
authenticatedUserId: string or null,
events: list of EventWithMetadata,
Each EventWithMetadata contains the event class/type and the payload as string (the serialized version of the actual event).
The MongoDB collection has the following indexes:
aggregateId, aggregateClass, version as unique
events.eventClass, sequence
sequence
other indexes for query optimization
These indexes are used to enforce the general event store rules (no events are stored for the same version of an Aggregate) and for query optimizations (the client can select only certain events - by type - from all streams).
You could use sharding by aggregateId to scale, if you strip the global ordering of events (the sequence property) and you move that responsibility to an event publisher but this complicates things as the event publisher needs to stay synchronized (even in case of failure!) with the event store. I recommend to do it only if you need it.
Benchmarks for this implementation (on Intel I7 with 8GB of RAM):
total aggregate write time was: 7.99, speed: 12516 events wrote per second
total aggregate read time was: 1.43, speed: 35036 events read per second
total read-model read time was: 3.26, speed: 30679 events read per second
I've noticed that MongoDB was slow on counting the number of events in the event store. I don't know why but I don't care as I don't need this feature.
I recommend using MongoDB as an event store.
I have an .NET Core event sourcing implementation project https://github.com/jacqueskang/EventSourcing
I started with relational database (SQL Server and MySQL) using entity framework core.
Then moved to AWS so I wrote a DynamoDB extension.
My experience is that relational DB can do the job perfectly but it depends on requirement and your technical stack. If your project is cloud based then the best option might probably be cloud provider's no-sql database like AWS DynamoDB or Azure CosmosDB, which are powerful in proformance and provide additional features (e.g. DynamoDB can trigger a notification or lambda function)
I am working in a specific project to change my repository to hazelcast.
I need find some documents by data range, store type and store ids.
During my tests i got 90k throughput using one instance c3.large, but when i execute the same test with more instances the result decrease significantly (10 instances 500k and 20 instances 700k).
These numbers were the best i could tuning some properties:
hazelcast.query.predicate.parallel.evaluation
hazelcast.operation.generic.thread.count
hz:query
I have tried to change instance to c3.2xlarge to get more processing but but the numbers don't justify the price.
How can i optimize hazelcast to be more fast in this scenario?
My user case don't use map.get(key), only map.values(predicate).
Settings:
Hazelcast 3.7.1
Map as Data Structure;
Complex object using IdentifiedDataSerializable;
Map index configured;
Only 2000 documents on map;
Hazelcast embedded configured by Spring Boot Application (singleton);
All instances in same region.
Test
Gatling
New Relic as service monitor.
Any help is welcome. Thanks.
If your use-case only contains map.values with a predicate, I would strongly suggest to use object type as in in-memory storage model. This way, there will not be any serialization involved during Query execution.
On the other end, it is normal to get very high numbers when you only have 1 member. Because, there is no data moving across network. Potentially to improve, I would check EC2 instances with high network capacity. For example c3.8xlarge has 10 Gbit network, compared to High that comes with c3.2xlarge.
I can't promise, how much increase you can get, but I would definitely try these changes first.
Starting out on a new project and looking for advice on a suitable platform. Current thinking is between Hazelcast or AppScale, given our team’s combined (but limited) experience covers an older version of Hazelcast and GAE. Both can also apparently be setup on EC2, which may be the easiest way to meet the CPU demand we expect.
Problem Profile
1). Our data consists of many small records stored by date (but not always time). Some are small numerical records (business stats, looks like daily weather info or stock market prices) and some are bulky text (log file entries). Data volumes not huge, in the region of hundreds/day between 1k and 50k each.
2). Very very large number of instances of computationally expensive numerical models (think monte-carlo sims) operate constantly over fixed-size windows of the same data.
3). A number of monitoring agents make data available.
4). Larger (longer periods of time) sets of the same data to be processed offline once daily.
With Hazelcast we would add incoming data to maps and use the Executor service to run models over the shared data. Likely use of Tomcat to provide minimal front end access to the grid as required.
With AppScale we would add tables per data-type and use the Task Queues API to frame the numerical models. Servlets deployed to AppScale as per GAE to provide front end.
Question
Should we use AppScale or Hazelcast for requirements like this? That is - for the problem as stated, are there any stand-out factors for/against either platform that we should consider?
If you prefer/require a distributed, service-oriented programming model (bag of tasks) then the answer is AppScale. If you prefer/require a parallel programming model (single machine abstraction) then the answer is Hazelcast. AppScale is also a complete cloud platform (vs only a datastore) which enables you to do more things with your app as it evolves. If you go with AppScale, you can adjust the timing restriction on the tasks and customize the platform with the libraries you want to use, for your computationally expensive methods.