Can anybody say how new jet cluster instance should start job?
Use case 1:
start jet cluster by 3 node
submit job to cluster
all 3 nodes start job and process data
Use case 2:
start 4th node
4th node do nothing because it's no new submit job command
How new cluster instance should start jobs, that already started at another nodes?
The feature you ask for is planned for Jet 0.5, which is planned for the end of September 2017.
In Jet 0.4 you have to cancel the current job and start it anew, however you'll lose processor state. Also note that the job is not cancelled by cancelling the client which submitted the job, you have to use:
Future<Void> future = jetInstance.newJob().execute();
// some time later
future.cancel();
Related
I have a workflow in Databricks called "score-customer", which I can run with a parameter called "--start_date". I want to make a job for each date this month, so I manually create 30 runs - passing a different date parameter for each run. However, after 5 concurrent runs, the rest of the runs fail with:
Unexpected failure while waiting for the cluster (1128-195616-z656sbvv) to be ready.
I want my jobs to wait for the cluster to become available instead of failing, how is this achieved?
I have an app where i have created Jet instance and pipeline job to aggregate result of an streaming data. I am running multiple instances of such app.
The problem i am facing is since there are 2 instaces it is running 2 pipeline job and hence the result is computed twice and incorrect but it figures out that both jet instance are part of the same cluster.
Does jet pipeline do not check the pipeline job and if same just consider it as one just like kafka stream does it with its topology?
Job submission in Jet 0.7 is to the entire cluster. If you submit the same Pipeline/DAG twice, the job will execute twice.
The upcoming version adds newJobIfAbsent() method: if the job has a name, it will only submit the job unless there's an active job with equal name. If there is a job with equal name already, it will return Job handle to the already existing job.
We have configured 3 node cassandra cluster in RHEL 7.2 version and we are doing cluster testing. When we start cassandra in all 3 nodes they form a cluster and they work fine.
But when we bring one node down using "init 6" or "reboot" command, the rebooted node takes more time to join the cluster, however if we manually kill and start cassandra process the nodes join cluster immediately without any issues.
We have provided all 3 IPs as seed nodes and the cluster name is same for all 3 nodes and their respective IP as listen address.
Please help us in resolving this issue.
Thanks
Update
Cassandra - 3.9 version
While investigating the issue further we noticed Node 1 (rebooted
node) able to send "SYN", "ACK2" messages for both the nodes (Node
2, Node 3) even though nodetool status displays "Node 2 and 3 as
"DN"" only in "Node 1"
enter code here
After 10 - 15min we noticed "Connection Timeout" exception in Node
2 and 3. being thrown from OutboundTcpConnection.java (line # 311)
which triggers a state change event to "Node 1" and changes the
state as "UN".
if (logger.isTraceEnabled())
logger.trace("error writing to {}", poolReference.endPoint(), e);
Please let us know what triggers "Connection TimeOut" exception in "Node 2 and 3" and ways to resolve this.
We believe this issue is similar to https://issues.apache.org/jira/browse/CASSANDRA-9630
But when we bring one node down using "init 6" or "reboot" command, the rebooted node takes more time to join the cluster, however if we manually kill and start cassandra process the nodes join cluster immediately without any issues.
Remember that Cassandra writes everything to the commit log to ensure durability in case of some "plug-out-of-the-wall" event. When that happens, Cassandra reconciles the data stored on-disk with data in the commit log at start-up time. If there are differences, it could take a while for that reconciliation to complete.
That's why its important to run these commands before stopping Cassandra:
nodetool disablegossip
nodetool drain
Disabling gossip makes sure that the node you're shutting down won't take any additional requests. Drain ensures that anything in the commit log is written to disk. Once those are done, you can stop your node. Then the node restart should occur much faster.
We are using a hazelcast executor service to distribute tasks across our cluster of servers.
We want to shut down one of our servers and take it out of the cluster but allow it to continue working for a period to finish what it is doing but not accept any new tasks from the hazelcast executor service.
I don't want to shut down the hazelcast instance because the current tasks may need it to complete their work.
Shutting down the hazelcast executor service is not what I want. That shuts down the executor cluster-wide.
I would like to continue processing the tasks in the local queue until it is empty and then shut down.
Is there a way for me to let a node in the cluster continue to use hazelcast but tell it to stop accepting new tasks from the executor service?
Not that easily, however you have member attributes (Member::setX/::getX) and you could set an attribute to signal "no new tasks please" and when you submit a tasks you either preselect a member to execute on based on the attribute or you use the overload with the MemberSelector.
What's the behavior when a partition is sent to a node and the node crashes right before executing a job? If a new node is introduced into the cluster, what's the entity that detects the addition of this new machine? Does the new machine get assigned the partition that didn't get processed?
The master considers the worker to be failure if it didnt receive the heartbeat message for past 60 sec (according to spark.worker.timeout). In that case the partition is assigned to another worker(remember partitioned RDD can be reconstructed even if its lost).
For the question if the new node is introduced into cluster? the spark-master will not detect the new node addition to the cluster once the slaves are started, because before application-submit in cluster the sbin/start-master.sh starts the master and sbin/start-slaves.sh reads the conf/slaves file (contains IP address of all slaves) in spark-master machine and starts a slave instance on each machine specified. The spark-master will not read this configuration file after being started. so its not possible to add a new node once all slaves started.