not able to find class in spark package HBaseContext - apache-spark

I am trying to use "HBaseContext" through spark but not able fine any details all the details coming blank
[https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/spark/example/hbasecontext/][1]
I am trying to implement some method explained here
http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
can anyone help who has implemented any of these

Although HBaseContext already implemented and documented in HBase Reference Guide, but the author/the community has not release it yet, you can see from this link HBaseContext Commit History, the community is still working on it recently(there is no update for project SparkOnHBase for a long time), and for any download version of HBase, hbase-spark module is not included at all.
This is a big confusion for beginners, hope the community can improve it, to access HBase from Spark RDD, you can consider it as normal Hadoop DataSource, HBase does provide TableInputFormat and TableOutputFormat for this purpose.

Related

Reading CSV file with Spark runs sometimes forever

i'm using Spark 2.4.8 with the gcs-connector from com.google.cloud.bigdataoss in version hadoop2-2.1.8. For development i'm using a Compute Engine VM with my IDE. I try to consume some CSV files from a GCS bucket natively with the Spark .csv(...).load(...) functionality. Some files are loaded successfully, but some are not. Then in the Spark UI i can see that the load job runs forever until a timeout fires.
But the weird thing is, that when i run the same application packaged to a Fat-JAR in Dataproc cluster, all the same files can be consumed successfully.
What i am doing wrong?
#JanOels, As you have mentioned in the comment, using gcs-connector in version hadoop2-2.2.8 will resolve this issue and the latest version of hadoop2 is hadoop2-2.2.10.
For more information about all the versions of hadoop2 to use gcs-connector from com.google.cloud.bigdataoss this document can be referred.
Posting the answer as community wiki for the benefit of the community that might encounter this use case in the future.
Feel free to edit this answer for additional information.

What is exactly the need of spark when using talend?

I am new to both spark and talend.
But I read everywhere that both of these are ETL tools. I read another stackoverflow answer here. From the other answer what I understood is talend do use spark for large data processing. But can talend do all the ETL work efficiently that spark is doing without using spark under the hood? Or is it essentially a wrapper over spark where all the data is send to talend is actually put inside the spark inside talend for processing?
I am quite confused with this. Can someone clarify this?
Unlike Informatica BDM which has its own Blaze framework for processing on Hadoop (native), Talend relies on other frameworks such as Map Reduce (Hadoop using possibly tez underneath) or Spark engine. So you could avoid Spark, but there is less point in doing so. The key point is that we could expect I think some productivity using Talend as it is graphical based, which is handy when there are many fields and you do not need possibly the most skilled staff.
For NOSQL, like HBase, they provide specific connectors or could use the Phoenix route. Talend also has connectors for KAFKA.
Spark is just one of the frameworks supported by Talend. When you create a new job, you can pick Spark from the dropdown list. You can get more details in the docs.

Where can I find documentation for Spark Data source API?

Is there any official documentation existing for Spark Data source API. I could only find sample/example implementation information from DataBricks tutorials.
So there is not any official documentation of how to create your own custom datasource with spark because its in spark developer api. Still we have some good blogs you can check that may be they will be helpful. I mention some blogs here
http://sparkdatasourceapi.blogspot.nl/2016/10/spark-data-source-api-write-custom.html
https://michalsenkyr.github.io/2017/02/spark-sql_datasource
There are some example codes are also here pls check below
https://github.com/VishvendraRana/spark-custom-datasource
And if you want to check real project which is using spark data source api check apache carbondata
https://github.com/apache/carbondata

Spark with custom hive bindings

How can I build spark with current (hive 2.1) bindings instead of 1.2?
http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support
Does not mention how this works.
Does spark work well with hive 2.x?
I had the same question and this is what I've found so far. You can try to build spark with the newer version of hive:
mvn -Dhive.group=org.apache.hive -Dhive.version=2.1.0 clean package
This runs for a long time and fails in unit tests. If you skip tests, you get a bit farther but then run into compilation errors. In summary, spark does not work well with hive 2.x!
I also searched through the ASF Jira for Spark and Hive and haven't found any mentions of upgrading. This is the closest ticket I was able to find: https://issues.apache.org/jira/browse/SPARK-15691

In which version HBase integrate a spark API?

I read the documentation of spark and hbase :
http://hbase.apache.org/book.html#spark
I can see that the last stable version of HBase is 1.1.2, but I also see that apidocs is on version 2.0.0-SNAPSHOT and that the apidoc of spark is empty.
I am confused, why the apidocs and HBase version don't match?
My goal is to use Spark and HBase (bulkGet, bulkPut..etc). How do I know in which HBase version those functions have been implemented?
If someone have complementary documentation on this, it will be awesome.
I am on hbase-0.98.13-hadoop1.
Below is the main JIRA ticket for Spark integration into HBase, the target version is 2.0.0 which still under development, need waiting for the release, or build a version from source code by your own
https://issues.apache.org/jira/browse/HBASE-13992
Within the ticket, there are several links for documentation.
If you just want to access HBase from Spark RDD, you can consider it as normal Hadoop datasource, based on HBase specific TableInputFormat and TableOutputFormat
As of now, Spark doesn't come with HBase API as it has for the hive, you have manually put HBase jars in spark's classpath in spark-default.conf file.
see below link it has complete information about how to connect to HBase:
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

Resources