How can I buffer the twitter stream with redis before inserting into rethinkdb? - node.js

Where I'm At
I have a simple node.js twitter stream consumer that tracks various hashtags. Oftentimes, these are trending hashtags, which means a high-volume of twitter json is streaming into my consumer. I don't do any processing of the twitter json in the consumer.
What I Want
I want to store the tweet json objects in rethinkdb.
Assumptions
Due to the volume (and unpredictability of said volume) of tweets, I should avoid inserting the tweet json objects into rethinkdb as they are consumed (since the rate at which the tweets enter the consumer might be faster than the rate at which rethinkdb can write those tweets).
Since Redis is definitely fast enough to handle the writes of the tweet json objects as they are consumed, I can push the tweet json objects directly to redis and have another process pull those tweets out and insert them into rethinkdb.
What I Hope To Learn
Are my assumptions correct?
Does this architecture make sense? If not, can you suggest a better alternative?
If my assumptions are correct and this architecture makes sense,
a. What is the best way of using redis as a buffer for the tweets?
b. What is the best way of reading from (and updating/clearing) the redis buffer in order to perform the inserts into rethinkdb?

We do use this kind of architecture in our production. If the amount of data that you are going to handle doesn't exceeds the max memory limit of redis you can proceed this way. And also you need to take care of downtime.
What is the best way of using redis as a buffer for the tweets?
You can use a redis queue. Where you producer keeps pushing into the head.
And your consumer consumes from the tail and populates to your db.
http://redis.io/commands#list
You can use this solution Redis Pop list item By numbers of items as you have a similar requirement (producer is heavy and consumer needs to consume little quicker than popping one by one)

Related

Transparent Streaming & Batch processing

I'm still quite new to the world of stream and batch processing and trying to understnad concepts and speach. It is admitedly very possible that the answer to my question well known, easy to find or even answered a hundred times here at SO, but I was not able to find it.
The background:
I am working in a big scientific project (nuclear fusion research), and we are producing tons of measurement data during experiment runs. Those data are mostly streams of samples tagged with a nanosecond timestamp, where samples can be anything from a single by ADC value, via an array of such, via deeply structured data (with up to hundreds of entries from 1 bit booleans to 64bit double precision floats) to raw HD video frames or even string text messages. If I understand the common terminologies right, I would regard our data as "tabular data", for the most part.
We are working with mostly selfmade software solutions from data acquisition over simple online (streaming) analysis (like scaling, subsampling and such) to our own data sotrage, management and access facilities.
In view of the scale of the operation and the effort for maintaining all those implementations, we are investigating the possibilities to use standard frameworks and tools for more of our tasks.
My question:
In particular at this stage, we are facing the need for more and more sofisticated (automated and manual) data analytics on live/online/realtime data as well as "after the fact" offline/batch analytics of "historic" data. In this endavor, I am trying to understand if and how existing analytics frameworks like Spark, Flink, Storm etc. (possibly supported by message queues like Kafka, Pulsar,...) can support a scenario, where
data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Simply streaming the online data into storage and querying from there seems no option as we need both raw and analysed data for live monitoring and realtime feedback control of the experiment.
Also, letting the user query either a live input signal or a historic batch from storage differently would not be ideal, as our physicists mostly are no data scientists and we would like to keep such "technicalities" away from them and idealy the exact same algorithms should be used for analysing new real time data and old stored data from previous experiments.
Sitenotes:
we are talking about peek data loads in the range of 10th of gigabits per second coming in bursts of increasing length of seconds up to minutes - could this be handled by the candidates?
we are using timestamps in nanosecond resolution, even thinking about pico - this poses some limitations on the list of possible candidates if I unserstand correctly?
I would be very greatfull if anyone would be able to understand my question and to shed some light on the topic for me :-)
Many Thanks and kind regards,
Beppo
I don't think anyone can say "yes, framework X can definitely handle your workload", because it depends a lot on what you need out of your message processing, e.g. regarding messaging reliability, and how your data streams can be partitioned.
You may be interested in BenchmarkingDistributedStreamProcessingEngines. The paper is using versions of Storm/Flink/Spark that are a few years old (looks like they were released in 2016), but maybe the authors would be willing to let you use their benchmark to evaluate newer versions of the three frameworks?
A very common setup for streaming analytics is to go data source -> Kafka/Pulsar -> analytics framework -> long term data store. This decouples processing from data ingest, and lets you do stuff like reprocessing historical data as if it were new.
I think the first step for you should be to see if you can get the data volume you need through Kafka/Pulsar. Either generate a test set manually, or grab some data you think could be representative from your production environment, and see if you can put it through Kafka/Pulsar at the throughput/latency you need.
Remember to consider partitioning of your data. If some of your data streams could be processed independently (i.e. ordering doesn't matter), you should not be putting them in the same partitions. For example, there is probably no reason to mix sensor measurements and the video feed streams. If you can separate your data into independent streams, you are less likely to run into bottlenecks both in Kafka/Pulsar and the analytics framework. Separate data streams would also allow you to parallelize processing in the analytics framework much better, as you could run e.g. video feed and sensor processing on different machines.
Once you know whether you can get enough throughput through Kafka/Pulsar, you should write a small example for each of the 3 frameworks. To start, I would just receive and drop the data from Kafka/Pulsar, which should let you know early whether there's a bottleneck in the Kafka/Pulsar -> analytics path. After that, you can extend the example to do something interesting with the example data, e.g. do a bit of processing like what you might want to do in production.
You also need to consider which kinds of processing guarantees you need for your data streams. Generally you will pay a performance penalty for guaranteeing at-least-once or exactly-once processing. For some types of data (e.g. the video feed), it might be okay to occasionally lose messages. Once you decide on a needed guarantee, you can configure the analytics frameworks appropriately (e.g. disable acking in Storm), and try benchmarking on your test data.
Just to answer some of your questions more explicitly:
The live data analysis/monitoring use case sounds like it fits the Storm/Flink systems fairly well. Hooking it up to Kafka/Pulsar directly, and then doing whatever analytics you need sounds like it could work for you.
Reprocessing of historical data is going to depend on what kind of queries you need to do. If you simply need a time interval + id, you can likely do that with Kafka plus a filter or appropriate partitioning. Kafka lets you start processing at a specific timestamp, and if you data is partitioned by id or you filter it as the first step in your analytics, you could start at the provided timestamp and stop processing when you hit a message outside the time window. This only applies if the timestamp you're interested in is when the message was added to Kafka though. I also don't believe Kafka supports below-millisecond resolution on the timestamps it generates.
If you need to do more advanced queries (e.g. you need to look at timestamps generated by your sensors), you could look at using Cassandra or Elasticsearch or Solr as your permanent data store. You will also want to investigate how to get the data from those systems back into your analytics system. For example, I believe Spark ships with a connector for reading from Elasticsearch, while Elasticsearch provides a connector for Storm. You should check whether such a connector exists for your data store/analytics system combination, or be willing to write your own.
Edit: Elaborating to answer your comment.
I was not aware that Kafka or Pulsar supported timestamps specified by the user, but sure enough, they both do. I don't see that Pulsar supports sub-millisecond timestamps though?
The idea you describe can definitely be supported by Kafka.
What you need is the ability to start a Kafka/Pulsar client at a specific timestamp, and read forward. Pulsar doesn't seem to support this yet, but Kafka does.
You need to guarantee that when you write data into a partition, they arrive in order of timestamp. This means that you are not allowed to e.g. write first message 1 with timestamp 10, and then message 2 with timestamp 5.
If you can make sure you write messages in order to Kafka, the example you describe will work. Then you can say "Start at timestamp 'last night at midnight'", and Kafka will start there. As live data comes in, it will receive it and add it to the end of its log. When the consumer/analytics framework has read all the data from last midnight to current time, it will start waiting for new (live) data to arrive, and process it as it comes in. You can then write custom code in your analytics framework to make sure it stops processing when it reaches the first message with timestamp 'tomorrow night'.
With regard to support of sub-millisecond timestamps, I don't think Kafka or Pulsar will support it out of the box, but you can work around it reasonably easily. Just put the sub-millisecond timestamp in the message as a custom field. When you want to start at e.g. timestamp 9ms 10ns, you ask Kafka to start at 9ms, and use a filter in the analytics framework to drop all messages between 9ms and 9ms 10ns.
Allow me to add the following suggestions on how Apache Pulsar might help address some of your requirements. Food for thought as it were.
"data is flowing/streamed into the platform/framework, attached an identifier like a URL or an ID or such"
You might want to look at Pulsar Functions, which allows you to write simple functions (In Java or Python) that gets executed on each individual message that is published to a topic. They are ideal for this type of data augmentation use case.
the platform interacts with integrated or external storage to persist the streaming data (for years), associated with the identifier
Pulsar has recently added tiered-storage, that allows you to retain event streams in S3, Azure Blob Store, or Google Cloud storage. This would allow you to keep the data for years in a cheap and reliable data store
analytics processes can now transparently query/analyse data addressed by an identifier and an arbitrary (open or closed) time window, and the framework suplies data batches/samples for the analysis either from backend storage or coming in live from data acquisition
Apache Pulsar has also added integration with the Presto query engine, which would allow you to query the data over a given time period (including data from tiered-storage) and place it into a topic for processing.

Learning Spark Streaming

I am learning spark streaming using the book "Learning spark Streaming". In the book i found the following on a section talking about Dstream, RDD, block/partition.
Finally, one important point that is glossed over in this schema is that the Receiver interface also has the option of connecting to a data source that delivers a collection (think Array) of data pieces. This is particularly relevant in some de-serialization uses, for example. In this case, the Receiver does not go through a block interval wait to deal with the segmentation of data into partitions, but instead considers the whole collection reflects the segmentation of the data into blocks, and creates one block for each element of the collection. This operation is demanding on the part of the Producer of data, since it requires it to be producing blocks at the ratio of the block interval to batch interval to function reliably (delivering the correct number of blocks on every batch). But some have found it can provide superior performance, provided an implementation that is able to quickly make many blocks available for serialization.
I have been banging my head around and can't simply understand what the Author is talking about, although i feel like i should understand it. Can someone give me some pointers on that ?
Disclosure: I'm co-author of the book.
What we want to express there is that the custom receiver API has 2 working modes: one where the producing side delivers one-message-at-time and the other where the receiver may deliver many messages at once (bulk).
In the one-message-at-time mode, Spark is responsible of buffering and collecting the data into blocks for further processing.
In the bulk mode, the burden of buffering and grouping is on the producing side, but it might be more efficient in some scenarios.
This is reflected in the API:
def store(dataBuffer: ArrayBuffer[T]): Unit
Store an ArrayBuffer of received data as a data block into Spark's memory.
def store(dataItem: T): Unit
Store a single item of received data to Spark's memory.
I agree with you that the paragraph is convoluted and might not convey the message as clear as we would like. I'll take care of improving it.
Thanks for your feedback!

Handle duplicates in batch POST requests to a REST API

The stack
Express.js API server for CRUD operations over data.
MongoDB database.
Moongose interface for MongoDB for schemas.
The probem
In order to handle duplicates in just one point, I want to do it in the only possible entry point: The API.
Definition: duplicate
A duplicate is an entity which already exists in the data base, so the
new POST request is the same entity with exact the same data, or it is
the same entity with updated data.
The API design is meant to handle the new http2 protocol.
Bulk importers have been written. This programs get the data from a given source, transform the data to our specific format, and make POST request to save it. This importers are designed to handle every entity in parallel.
The API already has a duplication handler which works great when a given entity already exists in the database. The problem comes when the bulk importers make several POST requests for the same entity at the same time, and the entity doesn't exist in the database yet.
....POST/1 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
......POST/2 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
........POST/3 .databaseCheck.......DataBaseResult=false..........DatabaseWrite
.....................POST/N .databaseCheck.......DataBaseResult=false..........DatabaseWrite
This situation produces the creation of the same entity several times, because the database checks haven't finished when the rest of the POST requests arrive.
Only if the number of POST requests is big enough, the first write operation would have already finished, and the databaseCheck of the Nth request will return true.
What would be the correct solution for handle this?
If I'm not wrong, what I'm looking for has the name of transaction, and I don't know if this is something that the database should offer by default, or if it is something that I have to implement.
Solutions I have already considered:
1. Limit the requests, just one each time.
This is the simplest solution, but if the API remains blocked when the bulk importers make several requests, then the frontend client would get very slow, and it is meant to be fast, and multiplayer. So this, in fact, is not a solution.
2. Special bulk API endpoint for each entity.
If an application needs to make bulk requests, then make just one huge POST request with all the data as body request.
This solution doesn't block the API, and can handle duplicates very well, but what I don't like is that I would go against the http2 protocol, where many and small request are desired.
And the problem persists and other future clients may have this problem if they don't notice that there is available a bulk endpoint. But maybe this is not a problem.
3. Try to use the possible MongoDB transaction implementation
I've read a little bit about this, but I don't know if it would be possible to handle this problem with the MongoDB and Mongoose tools. I've done some search, but I haven't find anything, because before to try to insert many documents, I need to generate the data for each document, and that data is coming inside each POST request.
4. Drop MongoDB and use a transaction friendly database.
This would have a big cost at this point because the whole stack is already finished, and we are near to launch. We aren't afraid of refactor. But I think here would apply the 3rd solution considerations.
5. Own transactions implementation at the API level?
I've designed a solution that may work for every cases, and that I call the pool stream.
This is the design:
When a POST request arrives, a timer of a fixed amount of milliseconds starts. That amount of time would be big enough to catch several requests, and small enough in order to do not cause a noticeable delay.
Inside each chunk of requests, the data is processed trying to merge duplicates before writing in the database. So if inside a chunk n requests have been catch, n - m (where m <= n) unique candidates are generated. A hash function is applied to each candidate in order to assign the hash result to each request-response pair. Then the write operation to the database of the candidates is done in parallel, and the current duplicates handler would work for this at the write time.
When the writes for the current chunk finish, the response is sent to each request-response pair of the chunk, then the next chunk is processed. While a chunk is in the queue waiting for the write operation, could be doing the unique candidates process, in order to accelerate the whole process.
What do you think?
Thank you.

Using Spark to process requests

I would like to understand if the following would be a correct use case for Spark.
Requests to an application are received either on a message queue, or in a file which contains a batch of requests. For the message queue, there are currently about 100 requests per second, although this could increase. Some files just contain a few requests, but more often there are hundreds or even many thousands.
Processing for each request includes filtering of requests, validation, looking up reference data, and calculations. Some calculations reference a Rules engine. Once these are completed, a new message is sent to a downstream system.
We would like to use Spark to distribute the processing across multiple nodes to gain scalability, resilience and performance.
I am envisaging that it would work like this:
Load a batch of requests into Spark as as RDD (requests received on the message queue might use Spark Streaming).
Separate Scala functions would be written for filtering, validation, reference data lookup and data calculation.
The first function would be passed to the RDD, and would return a new RDD.
The next function would then be run against the RDD output by the previous function.
Once all functions have completed, a for loop comprehension would be run against the final RDD to send each modified request to a downstream system.
Does the above sound correct, or would this not be the right way to use Spark?
Thanks
We have done something similar working on a small IOT project. we tested receiving and processing around 50K mqtt messages per second on 3 nodes and it was a breeze. Our processing included parsing of each JSON message, some manipulation of the object created and saving of all the records to a time series database.
We defined the batch time for 1 second, the processing time was around 300ms and RAM ~100sKB.
A few concerns with streaming. Make sure your downstream system is asynchronous so you wont get into memory issue. Its True that spark supports back pressure, but you will need to make it happen. another thing, try to keep the state to minimal. more specifically, your should not keep any state that grows linearly as your input grows. this is extremely important for your system scalability.
what impressed me the most is how easy you can scale with spark. with each node we added we grew linearly in the frequency of messages we could handle.
I hope this helps a little.
Good luck

Cassandra for messaging

I am now doing a small research to find a way to store huge volume of data (temporarily, till some consumers consume these messages) from various 'message producers' (source). The data come from different sources, say HTTP, FTP, SMPP or file upload, each type may have tens or hundreds of instances creating messages. The messages produced by them can grow so huge that the message consumers may lag behind in consuming the messages as the processes may take long or not short time. Now, the system uses RabbitMQ in some parts, but its performance drops when huge volume of unconsumed message grows (I'm also looking into improving its performance, but that's different). As an alternate, I am looking on to Apache Kafka which uses the disk for persisting messages.
As I read through many articles in the internet, I read some articles that talks about the Apache Cassandra with very fast write, processing a million inserts per second and similar volume reads. I was astonished, and tried to find some leads in using Cassandra for my case but with no clear results.
Assuming I have large number of message producers, can Cassandra (cluster) handle inserts (in batches) so faster (overall high throughput) that the producers does not throttle?
I am sure some among you could have used Cassandra for this or similar kind of use cases, share you experiences. (I am ready to provide you any more information if this does not suffice)
Yes! Cassandra can handle writes very effectively. But in my experience, using it as a messaging system (queue and the likes) brings some technical constraints because of the tombstones.
Cassandra doesn't remove deleted rows immediately and marks them with a tomstone to be garbarge collected later. Overtime, if there are a lot of deletions (eg. dequeue messages), the overall performance will be hurt, and quite quickly.
You can go for Cassandra but you will have to implement work around for the tomstone problem (time bucket, multiple status tables).
IMHO, Apache Kafka is much more appropriate to the messaging use case and can also be scaled massively.

Resources