Where can I find documentation for Spark Data source API? - apache-spark

Is there any official documentation existing for Spark Data source API. I could only find sample/example implementation information from DataBricks tutorials.

So there is not any official documentation of how to create your own custom datasource with spark because its in spark developer api. Still we have some good blogs you can check that may be they will be helpful. I mention some blogs here
http://sparkdatasourceapi.blogspot.nl/2016/10/spark-data-source-api-write-custom.html
https://michalsenkyr.github.io/2017/02/spark-sql_datasource
There are some example codes are also here pls check below
https://github.com/VishvendraRana/spark-custom-datasource
And if you want to check real project which is using spark data source api check apache carbondata
https://github.com/apache/carbondata

Related

What is exactly the need of spark when using talend?

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

Hazelcast Jet connector for Apache Pulsar as source

I could not find a Hazlecast Jet source connector for Apache Pulsar. Have anybody tried this? Appreciate any directions, points, sources, considerations if I have to write a custom stream connector for Pulsar as source for Jet?
The initial version of the Jet connector for Apache Pulsar is recently implemented here. It hasn't been extensively tested yet. For details, you can look at the design document in which the connector's sufficiencies and deficiencies stated, and tutorial. If there is anything confusing about these, you can ask again.
Hazelcast Jet doesn't have any connector for Apache Pulsar as of now (version 4.0). If you'd like to contribute one you can have a look at the Source Builder class and its section on the reference manual as a starting point.
Also, please check out existing implementations of various connectors at the Hazelcast Jet extension modules repository which uses source builder API and contribute yours to there.

Sending Spark streaming metrics to open tsdb

How can I send metrics from my spark streaming job to open tsdb database? I am trying to use open tsdb as data source in Grafana. Can you please help me with some references where I can start.
I do see open tsdb reporter here which does similar job. How can I integrate the metrics from Spark streaming job to use this? Is there any easy options to do it.
One way to send the metrics to opentsdb is to use it's REST API. To use it, simply convert the metrics to JSON strings and then utilize the Apache Http Client library to send the data (it's in java and can therefore be used in scala). Example code can be found on github.
A more elegant solution would be to use the Spark metrics library and add a sink to the database. There has been a discussion on adding an OpenTSDB sink for the Spark metrics library, however, finally it was not added into Spark itself. The code is avaiable on github and should be possible to use. Unfortunalty the code is compatible on Spark 1.4.1, however, in worst case it should still be possible to get some indications of what is necessary to add.

Spark java with Google Store

Can some please share any examples on spark java read and write files from Google Store.
Thanks You in advance.
Good Store is should be backed by Google cloud there have a lot of Apache Spark and Hadoop APIs for Java examples it maybe useful for you
https://cloud.google.com/java/samples
https://cloud.google.com/hadoop/
Also here have a lot of Java for Spark jobs read/write files from Google Store
https://github.com/GoogleCloudPlatform/spark-examples/tree/master/gaming/java/src/main/java/com/google/cloud/sparkdemo

not able to find class in spark package HBaseContext

I am trying to use "HBaseContext" through spark but not able fine any details all the details coming blank
[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/spark/example/hbasecontext/][1]
I am trying to implement some method explained here
http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
can anyone help who has implemented any of these
Although HBaseContext already implemented and documented in HBase Reference Guide, but the author/the community has not release it yet, you can see from this link HBaseContext Commit History, the community is still working on it recently(there is no update for project SparkOnHBase for a long time), and for any download version of HBase, hbase-spark module is not included at all.
This is a big confusion for beginners, hope the community can improve it, to access HBase from Spark RDD, you can consider it as normal Hadoop DataSource, HBase does provide TableInputFormat and TableOutputFormat for this purpose.

Resources