I am new to Apache beam. So far, my understanding is, apache beam is nothing but the tool for ETL processing. Runners can be called as collection of CPU, memory and storage.
My question is, can i use two or more types of runners in single beam python code?
for example, one runner for dataflow, another for spark, third for directrunner, like this?
You can take your Beam pipeline, and submit it to be run on different runners.
You cannot make different runners work together (e.g. a pipeline that runs partially on Dataflow and partially on Spark).
Instead, you can write a pipeline that sometimes runs fully on Dataflow and sometimes runs fully on Spark.
LMK if I should clarify further.
Related
I have a pyspark batch job scheduled on YARN. There is now a requirement to put the logic of the spark job into a web service.
I really don't want there to be 2 copies of the same code, and therefore would like to somehow reuse the spark code inside the service, only replacing the IO parts.
The expected size of the workloads per request is small so I don't want to complicate the service by turning it into a distributed application. I would like instead to run the spark code in local mode inside the service. How do I do that? Is that even a good idea? Are there better alternatives?
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
When writing an Application in Scala using Spark, when ran, is it a regular Scala application which "delegates the spark jobs to the spark cluster" and gets the desired results back ?
Or does it get completely compiled to something special consumed by a "spark engine" ?
It depends on the "deploy mode"
If you use local mode, then all actions occur locally, and you don't get any benefits from distribution that Spark is meant for. While it can be used to abstract different libraries and provide clean ways to process data via dataframes or ML, it's not really intended to be used like that
Instead, you can use cluster mode, in which your app just defines the steps to take, and then when submitted, everything happens in the remote cluster. In order to process data back in the driver, you need to use methods such as collect(), or otherwise download the results from remote file systems/databases
My workflow consists of several tasks(Sequential and parallel) ranging from collecting data from Hbase and performing various machine learning algorithms on those data etc.
Is it possible to execute them in Apache Spark without using workflow manager? The reason I ask is I have an algorithm to order the tasks in batches (Tasks that can be run together). Can I submit them directly to Spark?
You might be looking for the Spark jobs scheduling within an application: https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application .
Following the configuration options mentioned above you can submit jobs (trigger jobs execution calling Spark actions) in parallel. You can apply here your algorithm to order the tasks in batches as well.
Keep in mind that some of your jobs can be dependent on the results of another ones running in parallel. Be sure to control the order of such jobs within your code (Spark doesn't do such kind of DAGs for you).
Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup