3 threads per request - multithreading

I have a spring boot app that will be used on a fairly popular e-commerce platform. I need to create 3 threads to run Cassandra queries in parallel with some business logic to make the service performant. Is this unheard of? I have barely used threads in my young career.

That right. I usually create batch to query cassandra. Use ThreadPool query cassandra to make the service performant

Related

What is the industry standard for number of clusters for a development team in Databricks?

I am a part of a team of 5 developers that work with gathering data, transforming, analyzing and predicting data in Azure Databricks (basically a combination of Data Science and Data Engineering).
Up until now we have been working on relatively small data, so the team of 5 could easily work with a single cluster with 8 worker nodes in development. Even though we are 5 developers, usually we're at maximum 3 developers in Databricks at the same time.
Recently we started working with "Big Data" and thus we need to make use of Databricks' Apache Spark parallelization methods to improve run-times for our codes. However, a problem that quickly came to light is that with more than one developer running parallelizing codes on a single cluster, there will be queues that slow us down. Because of this we have been thinking about increasing the amount of clusters in our dev-environment so that multiple developers can work on codes that take use of the Spark parallelizing methods.
My question is this: What is the industry standard for number of clusters to have in a development environment? Do teams usually have a cluster per developer? That sounds like it could easily become quite expensive in terms of economic costs.
Usually I see following pattern:
There is a shared cluster for many people for adhoc experimenting, "small" data processing, etc. Please notice that current versions of databricks runtimes is trying to split resources between all users.
If some people need to run something "heaviweight", like, integration tests, etc., closer to production workloads, it's allowed them to create clusters. But to control costs, etc. it's recommended to use cluster policies to limit a size of cluster to create, node types, auto-termination times, etc.
For development clusters it's ok to use spot instances because Databricks cluster manager will pull new instances if existing ones are evicted
SQL queries could be more efficient to run on SQL warehouses that are optimized for BI workloads
P.S. Really, integration tests, and similar things could be easily run as jobs that are less expensive

Apache Flink - Run same job multiple times for multi-tenant applications

We have a multi-tenant application where we maintain message queue for each tenant. We have implemented a Flink job to process the stream data from the message queues. Basically each of the message queue is a source in the Flink job. Is this the recommended way to do so? Or is it ok to run the same job (with one source) multiple times based on the number of tenants? We expect that the each of the tenants will produce data in different volumes. Will there be any scalability advantages in the multi job approach?
Approaches
1: Single job with multiple sources
2. Run duplicate of same job each with one source each
I think these approaches suits to Storm, Spark or any other streaming platforms.
Thank you
Performance-wise approach 1) has the greatest potential. Resources are better utilized for the different sources. Since it's different sources, the optimization potential of the query itself is limited though.
However, if we really talk multi-tenant, I'd go with the second approach. You can assign much more fine-grain rights to the application (e.g., which Kafka topic can be consumed, to which S3 bucket to write). Since most application developer tend to develop GDPR compliant workflows (even though the current countries might not be affected), I'd go this route to stay on the safe side. This approach also has the advantage that you don't need to restart the jobs for everyone if you add/remove a certain tenant.

Distributing scheduled tasks across multi-datacenter environment in Node.js with Cassandra

We are attempting to build a system that gets a list of task to execute from a Cassandra database and then through some kind of group consensus creates an execution plan (preferably on one node) which is then agreed on and executed by the entire cluster of servers. We really do not want to add any additional pieces of software such as Redis or a AMPQ system, rather have the consensus built directly into all of the servers running the jobs. So far we have found Skiff, an implementation of the Raft algorithm that looks like it could accomplish the task, but I was wondering if anyone has found an elegant solution to this problem in a pure Node.js way not involving external messaging systems.
Cassandra supports lightweight transactions, which is basically Paxos implementation that offers linearizable consistency and CAS operation (consensus). So you can use Cassandra itself to serialize the execution plan.

Scala and Node.js

We chose Node.js for our web project, but there are many computational tasks for which we would prefer Scala. We are highly concerned about speed, what is the best way to call a Scala "worker" from Node.js in an asynchronous non-blocking way?
When queuing jobs its best to have some kind of Broker like a message queue or a job queue. Redis is a popular choice, as it can also be used for caching, and storing data in memory. RabbitMQ is another common choice. The nice thing about having a Broker is it can hold the job until a worker pulls it out of queue when ever it has available resources. A broker also acts as a load balancer in a sense, where it holds jobs and you can have multiple worker nodes grabbing jobs allowing for high availability, scalability, and parallel processing.
You probably should not be so concerned about speed; in my experience concerns like readability and maintainability are more important in almost all projects.
For short-lived "remote procedure calls" of at most a few seconds, I would tend to use Apache Thrift, which has libraries for Javascript and the JVM (Scrooge is an alternative Scala implementation, oriented towards writing async backends using Twitter's Finagle futures library), allowing nonblocking calls; by using Thrift you get strongly typed interface definitions that are engineered for forward compatibility, and you know exactly what changes you can make to the interface without breaking compatibility.
Alternatively one could use an ordinary HTTP ("REST") interface; node is oriented towards making async HTTP calls, and libraries like Spray make it easy to offer a high-performance, async HTTP interface in Scala.
For longer-running "batch" tasks where you're less concerned about latency and more about reliability, it's probably better to use a dedicated task queue as #tsutrzl suggests.

Nodejs to utilize all cores on all CPUs

I'm going to create multithreaded application that highly utilize all cores on all CPUs doing some intensive IO (web browsing) and then intensive CPU (analyzis of crawled streams). Is NodeJS good for that (since it's single threaded and I don't wanna run couple of nodejs instances [one per single core] and sync between them). Or should I consider some other platform?
Node is perfect for that; it is actually named Node as reference to the intended topology of its apps, as multiple (distributed) nodes that communicate with each other.
Take a look at the built-in cluster module, which handles multi-instance applications and thread sharing.
Further reading
Multi Core NodeJS App, is it possible in a single thread framework? by Cristian Ramirez on Codeburst
Scaling NodeJS Applications by Samer Buna on FreeCodeCamp
JavaScript V8 Engine was made to work with async tasks running on One core. However, it doesn't mean that you can have multiple cores running the same or perhaps, differente applications that communicate between each other.
You just have to be aware of some multiple-cores problems that might occur.
For example, if you are going to share LOTS of information between threads, then perhaps this is not the best language for you.
Considering the factor of multi-core language, I have recently been introduced to Elixir, based on Erlang (http://elixir-lang.org/).
It is a really cool language, developed 100% thinking about multi-thread applications. But it was made to make it easy, and also very fast applications that can be scalonable for as many cores as you want/can.
Back to node, the answer is yes, it support multi-thread, but is up to you to decide what to continue with. Take a look at this answer, and you might clarify your mind: Node.js on multi-core machines

Resources