I only know the basics of Apache Spark framework. Is it possible to extend Spark scheduler to place its task's operators on different node machines like it is possible in Apache Storm?
On Apache Storm is only implement the IScheduler and its methods.
Related
Dask is a pure python based distributed computing platform, similar to Apache Spark.
Is there a way to run & monitor Dask distributed jobs/tasks through REST API, like Apache Livy for Apache Spark?
Not quite what you ask, but take a look at prefect which has a strong integration with dask (for task execution).
I want to use KLL sketch for a stream application. The catch is that KLL sketch is a stateful computation and thus not idempotent. Can this be implemented using in Beam or Flink, preferably in Python?
Apache Datasketch says Integration efforts have started with Apache Flink and Apache Impala. There is also interest from Apache Beam. What is the main difficulty? Most sketches use primitive types and arrays internally.
You can use state and timers to use this in a streaming Beam Pipeline.
I was reading about apache beam. Gone through various runners in apache beam. But i am wondering why should someone use apache beam with spark runner if he can directly use apache spark?
Because of Apache Beam is unified, portable, and extensible that implement batch and streaming data processing jobs that run on any execution engine. It's meaning that you can write just one code for both streaming and batch jobs without any dependency on the execution platform
I'm new to workflow engines and I'm in a need of fork a couple of my jobs. So I thought of using Apache Oozie for the purpose.I use Spark Standalone as my cluster manager.
But Most of the documents I gone through talks only about oozie on YARN.My question "Is Oozie workflow is supported for spark standalone and recommended?"
if so, can you share an example for the same? alternatively I would also like to know the possibilities of doing fork
in spark without using any of the workflow engines. What is the industry standard way of scheduling jobs apart from cron?
Basically what I need to do is to integrate the CTBNCToolkit with Apache Spark, so this toolkit can take advantage of the concurrency and clustering features of Apache Spark.
In general I want to know is there any way exposed by Apache Spark developers to integrate any Java/Scala library in a fashion that the machine learning library can run on top of Spark's concurrency management?
So the goal is to make the stand alone machine learning libraries faster and concurrent.
No, that's not possible.
So what you want is that any algorithm runs on Spark. But, to parallelize the work, Spark uses RDDs or Datasets. So in order to run your tasks in parallel, the algorithms would have to use these classes.
The only thing that you could try, is to write your own Spark program, that makes use of any other library. But I'm not sure whether that's possible in your case. However, isn't Spark ML enough for you?