Guidewire batch job processing using Gatling - performance-testing

How can I run the Guidewire batch job scenarios using Gatling. Mainly want to create a scala script to run the batch job in gatling

In webservice catalogue of your center check the MaintenanceToolsAPI, there are few exposed operations:
getValidBatchProcessNames - will give you batch process names
startBatchProcess - for batch process without arguments
startBatchProcessWithArguments - for running batch process with arguments
Guidewire also did a batchrunner accelerator where they use this API and expose something a bit more consistent/usable this is available in marketplace.
In the end you should be fine with just calling the MaintenanceToolsAPI SOAP webservice from your scala scripts. What you will probably have to figure is the polling for the state of batch process, which is main feature of batchrunner accelerator.

Related

Update databricks job status through API

We need to execute a long running exe running on a windows machine and thinking of ways to integrate with the workflow. The plan is to include the exe as a task in the Databricks workflow.
We are thinking of couple of approaches
Create a DB table and enter a row when this particular task gets started in the workflow. Exe which is running on a windows machine will ping the database table for any new records. Once a new record is found, the exe proceeds with actual execution and updates the status after completion. Databricks will query this table constantly for the status and once completed, task finishes.
Using databricks API, check whether the task has started execution in the exe and continue with execution. After application finishes, update the task status to completion until then the Databricks task will run like while (true). But the current API doesn't support updating the task execution status (To Complete) (not 100% sure).
Please share thoughts OR alternate solutions.
This is an interesting problem. Is there a reason you must use Databricks to execute an EXE?
Regardless, I think you have the right kind of idea. How I would do this with the jobs api is as described:
Have your EXE process output a file to a staging location probably in DBFS since this will be locally accessible inside of databricks.
Build a notebook to load this file, having a table is optional but may give you addtional logging capabilities if needed. The output of your notebook should use the dbutils.notebook.exit method which allows you to output any value string or array. You could return "In Progress" and "Success" or the latest line from your file you've written.
Wrap that notebook in a databricks job and execute on an interval with a cron schedule (you said 1 minute) and you can retrieve the output value of your job via the get-output endpoint
Additional Note, the benefit of abstracting this into return values from a notebook is you could orchestrate this via other workflow tools e.g. Databricks Workflows or Azure Data Factory with inside an Until condition. There are no limits so long as you can orchestrate a notebook in that tool.

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.

Apache Nifi - Submitting Spark batch jobs through Apache Livy

I want to schedule my spark batch jobs from Nifi. I can see there is ExecuteSparkInteractive processor which submit spark jobs to Livy, but it executes the code provided in the property or from the content of the incoming flow file. How should I schedule my spark batch jobs from Nifi and also take different actions if the batch job fails or succeeds?
You could use ExecuteProcess to run a spark-submit command.
But what you seem to be looking for, is not a DataFlow management tool, but a workflow manager. Two great examples for workflow managers are: Apache Oozie & Apache Airflow.
If you still want to use it to schedule spark jobs, you can use the GenerateFlowFile processor to be scheduled(on primary node so it won't be scheduled twice - unless you want to), and then connect it to the ExecuteProcess processor, and make it run the spark-submit command.
For a little more complex workflow, I've written an article about :)
Hope it will help.

Spark job execution time

This might be a very simple question. But is there any simple way to measure the execution time of a spark job (submitted using spark-submit)?
It would help us in profiling the spark jobs based on the size of input data.
EDIT : I use http://[driver]:4040 to monitor my jobs, but this Web UI shuts down the moment my job finishes.
Every SparkContext launches its own instance of Web UI which is available at
http://[master]:4040
by default (the port can be changed using spark.ui.port ).
It offers pages (tabs) with the following information:
Jobs, Stages, Storage (with RDD size and memory use)
Environment, Executors, SQL
This information is available only until the application is running by default.
Tip : You can use the web UI after the application is finished by enabling spark.eventLog.enabled.
Sample web ui where you can see the time as 3.2hours:
SPARK itself provides much granular information about each stage of your Spark Job. Go to the Web interface of Spark on http://your-driver-node:4040, you can use also history server.
If you just need execution time, then go to "http://your-driver-node:8080", and you can see execution time for a job submitted to a spark.
If you want you can write a piece of code to get the net execution time.
Example:
val t1 = System.nanoTime //your first line of the code
val duration = (System.nanoTime - t1) / 1e9d //your last line of the code

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

Resources