What is exactly the need of spark when using talend? - apache-spark

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?

Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.

Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

Related

Spark as execution engine or spark as an application?

Which option is better to use, spark as an execution engine on hive or accessing hive tables using spark SQL? And Why?
A few assumptions here are:
Reason to opt for SQL is to stay user friendly, e.g. if you have business users trying to access data.
Hive is in consideration because it provides an SQL like interface and persistence of data
If that is true, Spark-SQL is perhaps the better way forward. It is better integrated within Spark and as an integral part of Spark, it will provide more features (one example is structured streaming). You will still get user friendliness and an SQL like interface to Spark so you will get full benefits. But you will need to manage your system only from Spark's point of view. Hive installation and management will still be there but from a single perspective.
Using Hive with Spark as execution engine will keep you limited based upon how good a translation Hive's libraries are able to do to convert your HQL to Spark. They may do a pretty good job but you will still loose the advanced features of Spark SQL. And new features may take longer to get integrated in Hive compared to Spark SQL.
Also, with Hive exposed to end users, some advanced users or data engineering teams may want access to Spark. This will cause you to manage two tools. System management may get more tedious compared to only using Spark-SQL in this scenario as Spark SQL has the potential to serve both non-technical and advanced users and even if advanced users use pyspark, spark-shell or more, they will still be integrated within the same toolset.

Spark goodness with Cassandra?

I've been reading about Apache Cassandra lately to learn how it works and how to use it for IoT projects, especially in the need of time series based database..
However, I started to notice that Apache Spark is often mentioned when people talk about Cassandra too.
The question is, as long as I can use Cassandra cluster of nodes to serve my app, to store and read data, why would I need Apache Spark? any useful use-cases are appreciated!
The answer is broad but summarizing ... Cassandra is highly scalable and there are lot of scenarios where it fits but CQL sintax has some limitations if you don't have your schema ready for some queries.
If you want to make use of your data without restrictions and doing analytical workloads with your cassandra data or join with other tables Spark is the most appropriate complement. Spark has a tight integration with Cassandra.
I recommend you to check this slides: http://www.slideshare.net/patrickmcfadin/apache-cassandra-and-spark-you-got-the-the-lighter-lets-start-the-fire?qid=48e2528c-a03c-49b4-879e-45599b2aff34&v=&b=&from_search=5
Cassandra is for storing data where as Spark is for performing some computation on top of it. Analogy with Hadoop: Cassandra is like HDFS where as Spark is like Map Reduce.
Especially with computations, when using DataStax Cassandra connector, data locality can be exploited. If you need to do some computation which modifies a row (but doesn't really depend on anything else), then that operation is optimized to run locally on each machine in cluster without any data movement in network.
Same goes with a lot of other Spark workload, the actions(some function which modifies the data) are done locally and only result is sent to client. As far as I know, when you want to do analytics on top of data stored in Cassandra, Spark is well supported and popular choice. If you don't need to do any operations on the data, still you can use Spark for other purposes like I mentioned below.
Spark streaming can be used to ingest or export data from Cassandra ( I used it a lot personally). The same data import/export can be achieved with small hand-written JDBC agents but Spark streaming code I wrote for ingesting 10GB data from Cassandra contains less than 20 lines of code with multi machine-multi threading built-in and an admin UI where I can see the job progress.
With Spark+Zeppelin, we can visualize Cassandra data using Spark, we can build beautiful UIs with little Spark code where users can even enter input and see the result as graph/table etc.
Note: Actually, visualization can be better with Kibana/ElasticSearch or Solr/Banana when used with Cassandra but they are very hard to setup and indexing has it's own issues to deal with.
There are a lot of other use cases, but personally I used Spark as a Swiss army knife for multiple tasks.
Apache cassandra is have feature like fast read and write so you can use it with the apache spark streaming to write your data directly into cassandra without legacy.
For use case you can consider any video application to upload video with the help of streaming and directly store it into cassandra blob.

How to share data from Spark RDD between two applications

What is the best way to share spark RDD data between two spark jobs.
I have a case where job 1: Spark Sliding window Streaming App, will be consuming data at regular intervals and creating RDD. This we do not want to persist to storage.
Job 2: Query job that will access the same RDD created in job 1 and generate reports.
I have seen few queries where they were suggesting SPARK Job Server, but as it is a open source not sure if it a possible solution, but any pointers will be of great help.
thankyou !
The short answer is you can't share RDD's between jobs. The only way you can share data is to write that data to HDFS and then pull it within the other job. If speed is an issue and you want to maintain a constant stream of data you can use HBase which will allow for very fast access and processing from the second job.
To get a better idea you should look here:
Serializing RDD
You can share RDDs across different applications using Apache Ignite.
Apache ignite provides an abstraction to share the RDDs through which applications can access the RDDs corresponding to different applications. In addition Ignite has the support for SQL indexes, where as native Spark doesn't.
Please refer https://ignite.apache.org/features/igniterdd.html for more details.
According to the official document describes:
Note that none of the modes currently provide memory sharing across applications. If you would like to share data this way, we recommend running a single server application that can serve multiple requests by querying the same RDDs.
http://spark.apache.org/docs/latest/job-scheduling.html
You can save to a temporary view. Table will be available to other sessions until the one that creates it is closed

Bluemix Spark Service

Firstly, I need to admit that I am new to Bluemix and Spark. I just want to try out my hands with Bluemix Spark service.
I want to perform a batch operation over, say, a billion records in a text file, then I want to process these records with my own set of Java APIs.
This is where I want to use the Spark service to enable faster processing of the dataset.
Here are my questions:
Can I call Java code from Python? As I understand it, presently only Python boilerplate is supported? There are few a pieces of JNI as well beneath my Java API.
Can I perform the batch operation with the Bluemix Spark service or it is just for interactive purposes?
Can I create something like a pipeline (output of one stage goes to another) with Bluemix, do I need to code for it ?
I will appreciate any and all help coming my way with respect to above queries.
Look forward to some expert advice here.
Thanks.
The IBM Analytics for Apache Spark sevices is now available and it allow you to submit a java code/batch program with spark-submit along with notebook interface for both python/scala.
Earlier, the beta code was limited to notebook interactive interface.
Regards
Anup

Data analytics on Cassandra

We are using Apache Cassandra to save data into. Except the spark what are the tools/technologies to perform the data analytics after reading data from cassandra. Spark is good but it needs a programmer(java/scala/python) to add/modify the future requirements which leads to high maintenance cost. What are the other alternatives?
If you want to go with Spark on top of Cassandra, many have accomplished good results with Cassandra, Hive, and Hadoop. Others have accomplished similar results using a mix of Cassandra, Hive, and Solr.
Another decent set of slides and tutorial for running analysis of data via Cassandra and Hadoop. You will find more in depth explanation of this via the PDF download on the provided page.
If you're interested in continuing to pursue Spark, you can evaluate DataStax Enterprise, which took the complexity out of it and allows you to run Spark right on top of Cassandra.
To answer your question, you have a few industry proven options... Primarily Hadoop and Hive.

Resources