I am learning and evaluating sparks and Flink before picking one of them for a project that I got.
In my evaluation I came up with the following simple tasks, that I can figure out how to implement it in both framework.
Let say that
1-/ I have a stream of events that are simply information about the fact that some item have changed somewhere in a database.
2-/ I need for each of those event, to query the db to get the new version of the item
3-/ apply some transformation
4-/connect to another Db and write that results.
My question here is as follow:
Using Flink or Sparks, how can one make sure that the calls to the dbs are handle asynchronously to avoid thread starvation?
I come from scala/Akka, where typically we avoid to make blocking calls and use future all the ways for this kind of situation. Akka stream allows that fine grain level of detail for stream processing for instance Integrating stream with external service. This avoid thread starvation. While I wait in my io operation the thread can be used for something else.
In short I don't see how to work with futures in both frameworks.
So I believe that somehow this can be reproduce with both frameworks.
Can anyone please explain how this is supposed to be handled in Flink or sparks.
If this is not supported out of the box, does anyone has experience with getting it incorporated somehow.
Since version 1.2.0 of Flink, you can now use the Async I/O API to achieve this.
Related
I'm trying to build a Rust web framework using Actix that needs to query HBase backend. We have chosen to use the Thrift code generator to generate the APIs with this file. However we are having some troubles figuring out how to pass the connection to our web-tier query functions. The official Actix way is to use an extractor that extracts the application state, in this case an HBase connection. More specifically, we are able to create objects of type THBaseServiceSyncClient, which is the application data we wish to pass around that keeps an open connection to HBase and allows us to do queries.
The official way is to clone this data, for each running thread. The first issue we encountered is that, this type does not implement the Clone trait. We were able to implement our own Clone function, only to realize that it also does not implement the DerefMut trait. This is slightly harder, and cannot be circumvented due to the function definitions in the API linked above. The usual way to go about doing this, is to wrap the object with a Mutex. We have experimented with that, and it performed very poorly. The contention was way too high, and we simply cannot have one global connection for all the threads to use.
We researched how other popular databases connections are handled in Rust. We realize that a thread_pool is usually used, where a pool of active connections is kept, and a manager keeps track of the alive/dead connections and spins up more if needed. We found this r2d2 crate, which claims to provide a generic connection pool for Rust. Unfortunately there is no thrift support, and we experimented by implementing our very simple pool manager similar to the mysql variant here. The result was very underwhelming. The throughput was not nearly what we need, and a lot of the time is wasted on the pool manager, according to some simple flamegraph profiling.
Is there some more obvious ways to achieve this goal that we're missing here? I'm wondering if anyone has experienced similar issues and can provide some inside as to what's the best way to go about doing this. Much appreciated.
I am writing payroll management web application in nodejs for my organisation. In many cases application shall involve cpu intensive mathematical calculation for calculating the figures and that too with many users trying to do this simulatenously.
If i plainly write the logic (setting aside the fact that i already did my best from algorithm and data structure point of view to contain the complexity) it will run synchronously blocking the event loop and make request, response slow.
How to resolve this scenario? What are the possible options to do this asynchronously? I also want to mention that this calculation stuff can be let to run in the background and later i can choose to tell user via notification about the status. I have searched for the solution all over this places and i found some solutions but only in theory & i haven't tested them all by implementing. Mentioning below:
Clustering the node server
Use worker threads
Use an alternate server and do some load balancing.
Use a message queue and couple it with worker thread to do backgound tasks.
Can someone suggest me some tried and battle tested advice on this scenario? and also some tutorial links associated with that.
You might wanna try web workers,easy to use and documented.
https://developer.mozilla.org/en-US/docs/Web/API/Web_Workers_API/Using_web_workers
I have a Node.js app with a small set of users that is currently architected with a single web process. I'm thinking about adding an after save trigger that will get called when a record is added to one of my tables. When that after save trigger is executed, I want to perform a large number of IO operations to external APIs. The number of IO operations depends on the number of elements in an array column on the record. Thus, I could be performing a large number of asynchronous operations after each record is saved in this particular table.
I thought about moving this work to a background job as suggested in Worker Dynos, Background Jobs and Queueing. The article gives as a rule of thumb that tasks that take longer than 500 ms be moved to background job. However, after working through the example using RabbitMQ (Asynchronous Web-Worker Model Using RabbitMQ in Node), I'm not convinced that it's worth the time to set everything up.
So, my questions are:
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
Is there a way to do this that is easier than implementing a message queue?
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
this is more a question of preference, than anything.
in general i say no - it's not ok... but that's based on experience in building rabbitmq services that run in heroku workers, and not seeing this as a difficult thing to do.
with a little practice, you may find that this is the simpler solution, as I have (it allows simpler code, and more robust code, as it splits the web away from the background processor - allowing each to run without knowing about each other directly)
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
are you missing something? not really
as long as you write your current in-the-web-process code in a well structured and modular fashion, moving it to a background process is not usually a big deal
most of the panic that people get from having to move code into the background, comes from having their code tightly coupled to the HTTP request / response process (i know from personal experience how painful it can be)
Is there a way to do this that is easier than implementing a message queue?
there are many options for distributed computing and background processing. i personally like RabbitMQ and the messaging patterns that it uses.
i would suggest giving it a try and seeing if it's something that can work well for you.
other options include redis with pub/sub libraries on top of it, using direct HTTP API calls to another web server, or just using a timer in your background process to check database tables on a given frequency and having the code run based on the data it finds.
p.s. you may find my RabbitMQ For Developers course of interest, if you are wanting to dig deeper into RMQ w/ node: http://rabbitmq4devs.com
I have been reading about Clojure for some time and I'm considering it as a replacement to Node.js (which I have used for another project). The most promising library seems to be Aleph/Lamina, which unfortunately doesn't have nearly as many examples as Node. My questions are:
How can I process requests with a chain of async operations, such as reading a document from MongoDB, doing some calculations, saving the new document and sending it in the response? I was not able to write it from the examples in Lamina wiki page. It sounds like a pretty common use case and I was surprised not to found any code showing it. It would be great if you could show me some sample code.
Is this setup adequate for a heavy-load server (say, tens of thousands requests per second)? I can't afford to create one thread for each new request, so I need something similar to the Node approach.
Is there any example of a medium- or large-sized company out there using any of this?
Is there any better Clojure replacement to Node (besides Aleph/Lamina)? Perhaps Clojurescript targetting Node? My client is not written in Javascript, so using the same language in both client and server is not an advantage in my case.
Thanks!
Few pointers:
You need to look at Aleph which builds HTTP abstractions over Lamina channels abstraction.
Reading and writing docs to MongoDB can be async but the library should provide this. In Node.js the MongoDB library has to be async other wise it would screw up the Node programming model, where as this is not the case with Clojure so most probably the Clojure MongoDB library provides non-async function.
Async operations are only helpful in case of IO i.e reading/writing to mongodb, sending response back etc. Generation computations are CPU bound operations and has nothing to do with async model.
Vert.x is Java world Node.js. Clojure support is on roadmap. I would prefer Aleph as you can play in async and non-async world as required.
I am learnig the Play 2.0 framework for Scala and aside from being able to process requests, I would like to run a continuous task in the background, like a bunch of timers. And somehow be able to get access to those timers from the request-response actions without getting any thread synchronization problems. I have heard of Jobs in Play and there are Actors in Scala. However, I cannot find any info on Jobs in 2.0, they seem to have been replaced by Promises.. but really all this is not like running a persistent background thread, and I am not sure how Actors fit in the whole paradigm.
Basically, my question is - what is the traditional way to get this kind of persistance in Play 2.0.
Not quite right, the Jobs have not been replaced by Promises, but by scheduling to send messages to actors (see "Scheduling asynchronous tasks").
Anyway, actors seem to be the way to go for you. Play 2.0 uses Akka for that. It's quite simple, actually. The Akka home page has detailed explanation on what Actors are and what you can do with them, but you can just think of an Actor as some code (say, a function) with a mailbox. You can send messages to the mailbox, and the function will be run for each message that is waiting for it. This could be just a periodical signal for a recurring job, or a reference for a long background task telling it what it needs to update.