Is there any difference between the Job data returned from Databricks Jobs API 2.1 vs 2.0? - databricks

The main difference I see is that 2.1 requires paging results with a max page size of 25, while 2.0 does not support paging and will get you all of the results in one call.
Is there any difference in the structure or content of the "Job" objects returned from the Jobs Get and List apis, e.g. /api/2.0/jobs/list?
It is difficult for me to tell because I can't find the 2.0 specification, and the documentation just gives examples.
2.1 Open API specification
2.1 documentation (at time of post
is the 'latest', but you can't specifically reference via the version 2.1)
2.0
documentation

The response returned from the GET methods of Jobs API 2.1 and 2.0 are different for each of the endpoints. The following is a sample response returned from each of the endpoints.
jobs/list:
When using jobs 2.0:
When using jobs 2.1:
The difference here is that the cluster details and notebooks paths are missing in Jobs 2.1.
jobs/get:
Using jobs 2.0:
Using jobs 2.1
Information about tasks are present in Jobs API 2.1 but not in 2.0.
jobs/runs/get:
There are not many notable differences using this endpoint.
jobs/runs/get-output:
The response has been returned by the endpoint when using Jobs 2.0 API. But when using Jobs 2.1, it returned the following error:
So, the main difference for the GET methods is noticeable for jobs/list endpoint. The cluster details and the notebook details are absent in jobs 2.1 endpoint. Therefore, choose the API as per requirement.

There are few significant changes between 2.0 & 2.1 APIs:
Biggest change is support for jobs with multiple tasks:
now instead of single task in the top-level object you need to specify them in the tasks array
you can specify dependencies between tasks
you can reuse the same job cluster between multiple tasks - it will give your faster tasks startup
there are more supported task types in 2.1: dbt, sql, ...
Similarly, things like, get run output are supporting multiple tasks, but for example, you can't get output for the top-level job run, but need to get it for individual tasks
the list operation is now supporting paginated output that allows to overcome a previous limit of 3000 jobs per workspace. It also now supports listing of jobs by name
there is a new API call for repairing/re-runnign the failed tasks without the whole job.

Related

Spark for Constraint Engine

I have just started learning Spark and have setup a small cluster too on top of Yarn and Hdfs. I have submitted small jobs for testing as well. However, I want to know if we can use Spark as a realtime constraint engine. Let me give an example. I want to build a web app where a user provides a JSON object via API. The user also provides some constraints (if some_value < 100 remove the the row etc). Can I perform these operations concurrently using Spark and provide quick response to the client (response to be provided as a response of the API). One important requirement being, I will have to provide response in a very short time, lesser than a second.
I see that Spark is all about submitting jobs and processing them concurrently. Can the above requirements be fulfilled by Spark?

Spark Application as a Rest Service

I have a question regarding a specific spark application usage.
So I want our Spark application to run as a REST API Server, like Spring Boot Applications, therefore it will not be a batch process, instead we will load the application and then we want to keep the application live (no call to spark.close()) and to use the application as Realtime query engine via some API which we will define. I am targeting to deploy it to Databricks. Any suggestions will be good.
I have checked Apache Livy, but not sure whether it will be good option or not.
Any suggestions will be helpful.
Spark isn't designed to run like this; it has no REST API server frameworks other than the HistoryServer and Worker UI built-in
If you wanted a long-running Spark action, then you could use Spark Streaming and issue actions to it via raw sockets, Kafka, etc. rather than HTTP methods
Good question let's discuss step by step
You can create it and it's working fine , following is example :
https://github.com/vaquarkhan/springboot-microservice-apache-spark
I am sure you must be thinking to create Dataset or Data frame and keep into memory and use as Cache (Redis,Gemfire etc ) but here is catch
i) If you have data in few 100k then you really not needed Apache Spark power Java app is good to return response really fast.
ii) If you have data in petabyte then loading into memory as dataset or data frame will not help as Apache Spark doesn’t support indexing since Spark is not a data management system but a fast batch data processing engine, and Gemfire you have flexibility to add index to fast retrieval of data.
Work Around :
Using Apache Ignite’s(https://ignite.apache.org/) In-memory indexes (refer Fast
Apache Spark SQL Queries)
Using data formats that supports indexing like ORC, Parquet etc.
So Why not use Sparing application with Apache Spark without using spark.close().
Spring application as micro service you need other services either on container or PCF/Bluemix/AWS /Azure/GCP etc and Apache Spark has own world and need compute power which is not available on PCF.
Spark is not a database so it cannot "store data". It processes data and stores it temporarily in memory, but that's not presistent storage.
Once Spark job submit you have to wait results in between you cannot fetch data.
How to use Spark with Spring application as Rest API call :
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.
https://livy.apache.org/

How to display step-by-step execution of sequence of statements in Spark application?

I have an Apache Spark data loading and transformation application with pyspark.sql that runs for half an hour before throwing an AttributeError or other run-time exceptions.
I want to test my application end-to-end with a small data sample, something like Apache Pig's ILLUSTRATE. Sampling down the data does not help much. Is there a simple way to do this?
It sounds like an idea that could easily be handled by a SparkListener. It gives you access to all the low-level details that the web UI of any Spark application could ever be able to show you. All the events that are flying between the driver (namely DAGScheduler and TaskScheduler with SchedulerBackend) and executors are posted to registered SparkListeners, too.
A Spark listener is an implementation of the SparkListener developer API (that is an extension of SparkListenerInterface where all the callback methods are no-op/do-nothing).
Spark uses Spark listeners for web UI, event persistence (for Spark History Server), dynamic allocation of executors and other services.
You can develop your own custom Spark listeners and register them using SparkContext.addSparkListener method or spark.extraListeners setting.
Go to a Spark UI of your job and you will find a DAG Visualization there. That's a graph representing your job
To test your job on a sample use sample as an input first of all ;) Also you may run your spark locally, not on a cluster and then debug it in IDE of your choice (like IDEA)
More info:
This great answer explaining DAG
DAG introduction from DataBricks

How to create a custom apache Spark scheduler?

I have a p2p mesh network of nodes. It has its own balancing and given a task T can reliably execute it (if one node fails another will continue). My mesh network has Java and Python apis. I wonder what are the steps needed to make Spark call my API to lunch tasks?
Oh boy, that's a really broad question, but I agree with Daniel. If you really want to do this, you could first start with:
Scheduler Backends, which states things like:
Being a scheduler backend in Spark assumes a Apache Mesos-like model in which "an application" gets resource offers as machines
become available and can launch tasks on them. Once a scheduler
backend obtains the resource allocation, it can start executors.
TaskScheduler, since you need to understand how tasks are meant
to be scheduled to build a scheduler, which mentions things like
this:
A TaskScheduler gets sets of tasks (as TaskSets) submitted to it from the DAGScheduler for each stage, and is responsible for sending
the tasks to the cluster, running them, retrying if there are
failures, and mitigating stragglers.
An important concept here is the Dependency Acyclic Graph (GDA),
where you can take a look at its GitHub page.
You can also read
What is the difference between FAILED AND ERROR in spark application states to get an intuition.
Spark Listeners — Intercepting Events from Spark can also come
in handy:
Spark Listeners intercept events from the Spark scheduler that are emitted over the course of execution of Spark applications.
You could take first the Exercise: Developing Custom SparkListener
to monitor DAGScheduler in Scala to check your understanding.
In general, Mastering Apache Spark 2.0 seems to have plenty of resources, but I will not list more here.
Then, you have to meet the Final Boss in this game, Spark's Scheduler GitHub page and get the feel. Hopefully, all this will be enough to get you started! :)
Take a look at how existing schedulers (YARN and Mesos) are implemented.
Implement the scheduler for your system.
Contribute your changes to the Apache Spark project.

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

Resources