Is it possible to use an apache-ignite rdd implementation in pyspark? - apache-spark

I am using apache spark to run some python data code via pyspark. I am running in Spark standalone mode with 7 nodes.
Is it possible to use the apache-ignite RDD implementation in this set up? Doe it offer any advantage?
Many thanks

Yes, you can use Ignite in any Spark deployment. Please refer to documentation to better understand possible advantages: https://apacheignite-fs.readme.io/docs/ignite-for-spark

Related

Alternative to Apache livy for Dask distributed

Dask is a pure python based distributed computing platform, similar to Apache Spark.
Is there a way to run & monitor Dask distributed jobs/tasks through REST API, like Apache Livy for Apache Spark?
Not quite what you ask, but take a look at prefect which has a strong integration with dask (for task execution).

What is exactly the need of spark when using talend?

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

Suggestions for using Spark in AWS without any cluster?

I want to leverage a couple Spark APIs to convert some data in an EC2 container. We deploy these containers on Kubernetes.
I'm not super familiar with Spark, but I see there are requirements on Spark context, etc. Is it possible for me to just leverage the Spark APIs/RDDs, etc without needing any cluster? I just have a simple script I want to run that leverages Spark. I was thinking I could somehow fatjar this dependency or something, but not quite sure what I'm looking for.
Yes, you need a cluster to run spark.
Cluster is nothing more than a platform to install Spark
I think your question should be "Can Spark run on single/standalone node ?"
If this you want to know then yes spark can run on standalone node as spark has its own stand alone cluster.
"but I see there are requirements on Spark context":
SparkContext is the entry point of spark, you need to create it in order to use any spark function.

Filtering from IgniteRDD happen locally in Spark Application or in Ignite Server?

If I execute a Filter on IgniteRDD, then the filter is pushed-down to Ignite Server, or first the Spark RDD should first collect all the data and then execute the filter within Spark Application?
There is no collect at all, but as far as I know there is a distinction between to cases:
Plain filter will use standard Spark execution.
sql will be processed by Ignite itself without Spark usage.
It all depends on Catalyst Optimizer. You can check the plans to understand your pipeline and see where is it executed. Also debugging might help.
As it explains here - IgniteRDD is an implementation of Spark RDD to represent Ignite cache and use spark API. As example there shows - filter would operate on cache directly.

Is it possible to run Shark queries over Spark Streaming data?

Is it possible to run Shark queries over the data contained in the DStreams of a Spark Streaming application? (for istance inside a foreachRDD call)
Are there any specific API to do that?
Thanks.
To answer my question if someone is worried about the same problem:
the direct answer to my question is NO, you cannot run Shark directly on Spark Streaming data.
Spark SQL is currently a valid alternative, at least it was for my needs.
It is included in Spark and doesn't require more configuration, you can have a look at it here: http://spark.apache.org/docs/latest/sql-programming-guide.html

Resources