Distributing scheduled tasks across multi-datacenter environment in Node.js with Cassandra - node.js

We are attempting to build a system that gets a list of task to execute from a Cassandra database and then through some kind of group consensus creates an execution plan (preferably on one node) which is then agreed on and executed by the entire cluster of servers. We really do not want to add any additional pieces of software such as Redis or a AMPQ system, rather have the consensus built directly into all of the servers running the jobs. So far we have found Skiff, an implementation of the Raft algorithm that looks like it could accomplish the task, but I was wondering if anyone has found an elegant solution to this problem in a pure Node.js way not involving external messaging systems.

Cassandra supports lightweight transactions, which is basically Paxos implementation that offers linearizable consistency and CAS operation (consensus). So you can use Cassandra itself to serialize the execution plan.

Related

Which tools to use when migrating bounded data?

I recently started working on a content repository migration project between two different content management systems.
We have around 11 petabytes of documents in a source repository. We want to migrate all of them one document at a time by querying with source system API and saving through destination system API.
We will have a single standalone machine for this migration and should be able to manage (start, stop, resume) the whole process.
What platforms and tools would you suggest for such task? Is Flink's Dataset API for bounded data suitable for this job?
Flink's DataStream API is probably a better choice than the DataSet API because the streaming API can be stopped/resumed and can recover from failures. By contrast, the DataSet API reruns failed jobs from the beginning, which isn't a good fit for a job that might run for days (or weeks).
While Flink's streaming API is designed for unbounded data streams, it also works very well for bounded datasets.
If the underlying CMSes can support doing the migration in parallel, Flink would easily accommodate this. The Async I/O feature would be helpful in that context. But if you are going to do the migration serially, then I'm not sure you'll get much benefit from a framework like Flink or Spark.
Basically what David said above. The main challenge I think you'll run into is tracking progress such that checkpointing/savepointing (and thus restarting) works properly.
This assumes you have some reasonably efficient and stable way to enumerate the unique IDs for all 1B documents in the source system. One approach we've used in a previous migration project (though not with Flink) was to use the document creation timestamp as the "event time".

Hazelcast Jet - Use Cases

What are the use-cases of Hazelcast Jet? Has anyone started using it?
Our project uses Hazelcast for Distributed Map holding Key-Value pair and Distributed computing on those Keys to run the task at the node holding the Key. We use NearCache solution as well.
I was curious to know how different is Hazelcast Jet and what problems does it solve?
As of current version (0.3), Jet's advantage over just submitting a Runnable to each partition is the ability to perform grouping by a key other than the one used in the Hazelcast map. For this to work in a distributed environment you have to send each item to the processing unit responsible for its grouping key, and this is something that is easy to get from Jet.
Further from that, you can build a multistage cascade of groupBy operations, you can have forks in your data stream to reuse the same intermediate result in more than one way, you can build a pipeline where an I/O task distributes the processing of the data it reads across all CPU cores, etc... in short, all the advantages that a full-blown DAG computation engine offers.
By the time it reaches 1.0 Jet will also support fault-tolerant infinite stream processing, event time-based windows, and more.
2021 answer for use cases:
Change data capture streaming - Use Debezium/Hazelcast to detect changes to your database and stream to other microservices (if data is common), stream changes to a data lake, or update a search engine
Real time analytics - Take market data stream and perform technical analysis in realtime or twitter analysis
Async job processing - PDF conversion service

Scaling Nodejs server to multiple systems?

I want to build the chat servers in nodejs using express.I have used cluster module for scaling the server among the multiple cores but how do I scale up to different system?
Since Node.js does not support shared memory, distributing Node.js processes across multiple machines provides for the same experience as using a cluster to distribute processes across multiple cores—if your application can run as multiple independent processes within a single system, then it can also be distributed to run as multiple independent processes across multiple systems.
Great, so that's one less thing to worry about! Now, there are many infrastructure solutions out there that would abstract running clusters on several systems, but your application is otherwise oblivious to any one you might pick.
What will concern you, though, within the realm of your application and any single process, is discovering external services, communicating to processes across the infrastructure and communicating with processes within a cluster. Again, there are many solutions out there that will curtail to any particular requirement your application needs to address.
So far, the Node.js community has favored simple approaches that are highly specialized for solving a particular problem and then get out of your way. For instance:
Web socket clients and servers: low latency within a cluster; also works well across the whole network when you can just send some data and get on with your life, but it will bring things down to a crawl if you need to synchronize processes, such as sending some data, waiting and idling until a result eventually comes back
Redis: clusters are easy to set up, instances handle discovery on their own, enough atomic operations to provide a solid approach to sharing data among different instances and the pub-sub support provides for low-latency IPC
ZMQ: lauded for it's intelligent, highly-available connections, you can devise any messaging protocol with a few dozen lines of code that the next human being maintaining your application will be able to reason about
etcd: distributed, consistent key-value store; low infrastructural overhead, allows for implementing straightforward service discovery on top that will integrate nicely with every infrastructure solution out there
Consul: based on serf, like etcd, but strongly opinionated, provides for service discovery on steroids with many additional niceties; if you like managing things on your own and have the time to invest up front, I would heartily recommend further investigation
While this certainly doesn't cover all the options available, it should be enough to get you going in the right direction. With just these simple building blocks that are ridiculously easy to reason about, you should be able to distribute your application across several systems, running across several machines in several datacenters.
If you're using a process manager like PM2, it will take care of starting up your node app on different or same machines but to handle multiple machines you should look into Puppet, Chef or Ansible to scale. If you're on AWS, EC2 can be set to do it automatically.
Actually there can be multiple answers to this question because the answer depends on how you want to communicate amongst nodes, how you want to assign tasks to nodes and how you manage failures.
You may want to research on how other cluster managers work and then try to design something similar in your application.
Few Approaches:
1) Use a load balancer in the front and distribute load amongst the machine. This I think can be the simplest approach.
2) Use a messaging system like RabbitMQ/ActiveMQ (or any other AMQP) system for inter node communication and let there be a pool of master nodes who assigns tasks to specific nodes and communicates to node via AMQP Protocol.

implement mutex in node.js

I would like to implement a mutex inside my node.js application, here is the mutex in wiki http://en.wikipedia.org/wiki/Mutual_exclusion.
Is there any ready module for this topic? if not, any idea can help me to implement it?
There are many ways to accomplish this. Two easy ways are via Redis or Zookeeper servers. Node.js has very good modules for both of them.
In Redis you can use WATCH + MULTI commands to implement locking. In Zookeeper you can create ephemeral nodes. In both way no two processes will execute the critical operation at the same time.
I have recently implemented Redis approach in a node-ratelimiter module which is a critical part of our production applications where we need to guarantee no two processes increment the same value in Redis. Refer to WATCH and MULTI for details. The code is in fact very easy to understand and read.
For Zookeeper example, refer to Locks Recipe. It is possible to implement much more complex logic for distributed locks with Zookeeper ephemeral nodes. Redis solution is just a special case and works very well if you don't need more than that.
Using these two approaches you can implement mutexes for any app and any logic.

Cassandra as an embedded service and with custom consistency level

I am thinking of building an application that uses Cassandra as its data store, but has low latency requirements. I am aware of EmbeddedCassandraService from this blog post
Is the following implementation possible and what are known pitfalls (defects, functional limitations)?
1) Run Cassandra as an embedded service, persisting data to disk (durable).
2) Java application interacts with local embedded service via one of the following. What are the pros
TMemoryBuffer (or something more appropriate?)
StorageProxy (what are the pitfalls of using this API?)
Apache Avro? (see question #5 below)
3) Java application interacts with remote Cassandra service ("backup" nodes) via Thrift (or Avro?).
4) Write must always succeed to the local embedded Cassandra service in order to be successful, and at least one of the remote (non-embedded) Cassandra nodes. Is this possible? Is it possible to define a custom / complex consistency level?
5) Side-question: Cassandra: The Definitive Guide mentions in several places that Thrift will ultimately be replaced with Avro, but seems like that's not the case just yet?
As you might guess, I am new to Cassandra, so any direction to specific documentation pages (not the wiki homepage) or sample projects are appreciated.
Unless your entire database is sitting on the local machine (i.e. a single node), you gain nothing by this configuration. Cassandra will shard your data across the cluster, so (as mentioned in one of the comments) your writes will frequently be made to another node that owns the data. Presuming you write with a consistency level of at least one, your call will block until that other node acks the write. This negates any benefit of talking to the embedded instance since you have some network latency anyway.

Resources