Dask is a pure python based distributed computing platform, similar to Apache Spark.
Is there a way to run & monitor Dask distributed jobs/tasks through REST API, like Apache Livy for Apache Spark?
Not quite what you ask, but take a look at prefect which has a strong integration with dask (for task execution).
Related
I want to use KLL sketch for a stream application. The catch is that KLL sketch is a stateful computation and thus not idempotent. Can this be implemented using in Beam or Flink, preferably in Python?
Apache Datasketch says Integration efforts have started with Apache Flink and Apache Impala. There is also interest from Apache Beam. What is the main difficulty? Most sketches use primitive types and arrays internally.
You can use state and timers to use this in a streaming Beam Pipeline.
I was reading about apache beam. Gone through various runners in apache beam. But i am wondering why should someone use apache beam with spark runner if he can directly use apache spark?
Because of Apache Beam is unified, portable, and extensible that implement batch and streaming data processing jobs that run on any execution engine. It's meaning that you can write just one code for both streaming and batch jobs without any dependency on the execution platform
Basically what I need to do is to integrate the CTBNCToolkit with Apache Spark, so this toolkit can take advantage of the concurrency and clustering features of Apache Spark.
In general I want to know is there any way exposed by Apache Spark developers to integrate any Java/Scala library in a fashion that the machine learning library can run on top of Spark's concurrency management?
So the goal is to make the stand alone machine learning libraries faster and concurrent.
No, that's not possible.
So what you want is that any algorithm runs on Spark. But, to parallelize the work, Spark uses RDDs or Datasets. So in order to run your tasks in parallel, the algorithms would have to use these classes.
The only thing that you could try, is to write your own Spark program, that makes use of any other library. But I'm not sure whether that's possible in your case. However, isn't Spark ML enough for you?
I am using apache spark to run some python data code via pyspark. I am running in Spark standalone mode with 7 nodes.
Is it possible to use the apache-ignite RDD implementation in this set up? Doe it offer any advantage?
Many thanks
Yes, you can use Ignite in any Spark deployment. Please refer to documentation to better understand possible advantages: https://apacheignite-fs.readme.io/docs/ignite-for-spark
Apache Apex - is an open source enterprise grade unified stream and batch processing platform. It is used in GE Predix platform for IOT.
What are the key differences between these 2 platforms?
Questions
From a data science perspective, how is it different from Spark?
Does Apache Apex provide functionality like Spark MLlib? If we have to built scalable ML models on Apache apex how to do it & which language to use?
Will data scientists have to learn Java to built scalable ML models? Does it have python API like pyspark?
Can Apache Apex be integrated with Spark and can we use Spark MLlib on top of Apex to built ML models?
Apache Apex an engine for processing streaming data. Some others which try to achieve the same are Apache storm, Apache flink. Differenting factor for Apache Apex is: it comes with built-in support for fault-tolerance, scalability and focus on operability which are key considerations in production use-cases.
Comparing it with Spark: Apache Spark is actually a batch processing. If you consider Spark streaming (which uses spark underneath) then it is micro-batch processing. In contrast, Apache apex is a true stream processing. In a sense that, incoming record does NOT have to wait for next record for processing. Record is processed and sent to next level of processing as soon as it arrives.
Currently, work is under progress for adding support for integration of Apache Apex with machine learning libraries like Apache Samoa, H2O
Refer https://issues.apache.org/jira/browse/SAMOA-49
Currently, it has support for Java, Scala.
https://www.datatorrent.com/blog/blog-writing-apache-apex-application-in-scala/
For Python, you may try it using Jython. But, I haven't not tried it myself. So, not very sure about it.
Integration with Spark may not be good idea considering they are two different processing engines. But, Apache apex integration with Machine learning libraries is under progress.
If you have any other questions, requests for features you can post them on mailing list for apache apex users: https://mail-archives.apache.org/mod_mbox/incubator-apex-users/