Hazelcast jet pipeline created on an app with multiple instace causing problem - hazelcast-jet

I have an app where i have created Jet instance and pipeline job to aggregate result of an streaming data. I am running multiple instances of such app.
The problem i am facing is since there are 2 instaces it is running 2 pipeline job and hence the result is computed twice and incorrect but it figures out that both jet instance are part of the same cluster.
Does jet pipeline do not check the pipeline job and if same just consider it as one just like kafka stream does it with its topology?

Job submission in Jet 0.7 is to the entire cluster. If you submit the same Pipeline/DAG twice, the job will execute twice.
The upcoming version adds newJobIfAbsent() method: if the job has a name, it will only submit the job unless there's an active job with equal name. If there is a job with equal name already, it will return Job handle to the already existing job.

Related

How to make job wait for cluster to become available

I have a workflow in Databricks called "score-customer", which I can run with a parameter called "--start_date". I want to make a job for each date this month, so I manually create 30 runs - passing a different date parameter for each run. However, after 5 concurrent runs, the rest of the runs fail with:
Unexpected failure while waiting for the cluster (1128-195616-z656sbvv) to be ready.
I want my jobs to wait for the cluster to become available instead of failing, how is this achieved?

Detecting the end of an Azure Batch job

In my application I create an Azure batch job. It's a Node app and I use an azure-batch Node client, but I could also be using REST, I don't think it matters. I can't switch to a C# client, however.
I expect the job to be completed in a few seconds and I wish to pause the code until the batch job is over but I am not sure how to detect the end of the job without polling the Job Status API. Neither the Node client nor the REST API exposes such functionality. I thought I could maybe register for an event of some sort but was not able to find anything like that. There are job release tasks but I am not sure if I can achieve this using them.
Any ideas how the end of an Azure batch job can be detected from within my application?
One way to do this is once you add your tasks to the job, set the job's onAllTasksComplete property to 'terminatejob'.
Then you can poll the Job-Get API, and check the state property on the job for when the job is complete (https://learn.microsoft.com/en-us/rest/api/batchservice/job/get#jobstate or https://learn.microsoft.com/en-us/javascript/api/azure-batch/job?view=azure-node-latest#get-string--object-).

Cluster design for downloading/streaming a dataset to a user

In our system, we classically have two components: A Cloudera Hadoop cluster (CDH) and an OpenShift "backend" system. In HDFS, we have some huge .parquet files.
We now have a business requirement to "export the data by a user given filter criterion" to a user in "realtime" as downloadable file. So the flow is: The user enters a SQL like filter string, for instance user='Theo' and command='execution'. He then sends a GET /export request to our backend service with the filter string as parameter. The user shall now get a "download file" from his web browser and immediately start downloading that file as CSV (even if its multiple terrabytes or even petabytes in size, thats the user's choice if he wants to try out and wait that long). In fact, the cluster should respond synchronously but not cache the entire response on a single node before sending the result but only receive data at "internet speed" of the user and directly stream it to the user. (With a buffer of e.g. 10 oder 100 MB).
I now face the problem on how to best approach this requirement. My considerations:
I wanted to use Spark for that. Spark would read the Parquet file, apply the filter easily and then "coalesce" the filtered result to the driver which in turn streams the data back to the requesting backend/client. During this task, the driver should of course not run out of memory if the data is sent too slowly back to the backend/user, but just have the executors deliver the data in the same speed as it is "consumed").
However, I face some problems here:
The standard use case is that the user has fine grained filters so that his exported file contains something like 1000 lines only. If I'd submit a new spark job via spark-submit for each request, I already come into latencies of multiple seconds due to initialization and query plan creation (Even if its just as simple as reading and filtering the data). I'd like to avoid that.
The cluster and the backend are strictly isolated. The operation guys ideally don't want us to reach the cluster from the backend at all, but the cluster should just call the backend. We are able to "open" maybe one port, but we'll possibly not able to argue something like "our backend will run the spark driver but being connected to the cluster as execution backend".
Is it a "bad design smell" if we run a "server spark job", i.e. we submit an application with mode "client" to the cluster master which also opens a port for HTTP requests and only runs a spark pipeline on requests, but holds the spark context open all the time (and is reachable from our backend via a fixed URL)? I know that there is "spark-job-server" project which does this, but it still feels a bit weird due to the nature of Spark and Jobs, where "naturally" a job would be to download a file and not be a 24h running server waiting to execute some pipeline steps from time to time.
I have no idea on how to limit sparks result fetching so that the executors send in a speed so that the driver won't run out of memory if the user requested petabytes.. Any suggestion on this?
Is Spark a good choice for this task after all or do you have any suggestions for better tooling here? (At best in CDH 5.14 environment as we don't get the operation team to install any additional tool).

How to get job at new node of hazelcast jet cluster

Can anybody say how new jet cluster instance should start job?
Use case 1:
start jet cluster by 3 node
submit job to cluster
all 3 nodes start job and process data
Use case 2:
start 4th node
4th node do nothing because it's no new submit job command
How new cluster instance should start jobs, that already started at another nodes?
The feature you ask for is planned for Jet 0.5, which is planned for the end of September 2017.
In Jet 0.4 you have to cancel the current job and start it anew, however you'll lose processor state. Also note that the job is not cancelled by cancelling the client which submitted the job, you have to use:
Future<Void> future = jetInstance.newJob().execute();
// some time later
future.cancel();

Triggering spark job from UI

Requirement:
Trigger a spark job from UI by user action (say submit button click).
Once the spark job is finished, the summary of the status has to be displayed in UI.
Design approach:
1. Once the user initiates a job run by clicking the submit button from UI, we will make an insert into a Impala queue table using Impala JDBC.
The simplified structure of the queue table is as follows:
JOB_RUN_QUEUE (REQUEST_ID, STATUS, INPUT_PARAM_1, INPUT_PARAM_2, SUMMARY)
The initial request will have STATUS='SUBMIT'
Oozie will be configured to orchestrate the request handling and job spark execution.
Once Oozie finds entry into the queue table JOB_RUN_QUEUE with status='SUBMIT' it will pull the arguments from the queue table and trigger the spark job.
It will update the status in the queue table to 'IN PROGRESS'. Upon successfull completion it will update the summary and status in the queue table.
On failure it will update the status to FAILURE.
UI will read the data from the queue table and display on UI.
Questions:
1. Is there any alternative ad better design approach.
2. D I need to have a queue mechanism to the initial request or can I leverage some inbuilt functionality?

Resources