hbase-spark connector for spark2.1.0 - apache-spark

I am using below stack
Hadoop-2.7.7
spark-2.4.5
Hbase-2.1.0
zk-3.5.9
I want to read and write data on hbase using spark with spark-submit command. But i was unable to do so.
I have successfully started all services and also searched connectors for same but i didn't get.
I have tried to create connectors using below link https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md
But connector build getting failed somehow i have made it possible to get connectors from internet and tried with it
when i try to launch spark submit with below command my application is failing
spark-submit --jars /home/bigdata/downloads/hbase-spark-1.0.0.jar --packages org.apache.hbase:hbase-shaded-mapreduce:2.1.0 /home/bigdata/hbasefload.py
Error:
Traceback (most recent call last):
File "/home/bigdata/hbasefload.py", line 35, in <module>
.option("hbase.zookeeper.quorum", "node2.ellicium.com:2181")\
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 73 7, in save
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328 , in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.util.NoSuchElementException: key not found: catalog
As i try to write on to hbase using spark-shell with above jars it successfully get executed but failing with spark-submit.

Related

Unable to fix UnknownHostException while reading a csv file from a HDFS dir

My spark program is running on a server: serverA. I am running the code from pyspark terminal. Using that program, I am trying to read a csv file from another cluster set up on another server -> server: serverB, HDFS cluster: clusterB as below:
spark = SparkSession.builder.master('yarn').appName("Detector").config('spark.app.name','dummy_App').config('spark.executor.memory','2g').config('spark.executor.cores','2').config('spark.yarn.keytab','/home/testuser/testuser.keytab').config('spark.yarn.principal','krbtgt/HADOOP.NAME.COM#NAME.COM').config('spark.executor.instances','1').config('hadoop.security.authentication','kerberos').config('spark.yarn.access.hadoopFileSystems','hdfs://clusterB').config('spark.yarn.principal','testuser#NAME.COM').getOrCreate()
The file I am trying to read is on cluster: clusterB as below:
(base) testuser#hdptetl:[~] {46} $ hadoop fs -df -h
Filesystem Size Used Available Use%
hdfs://clusterB 787.3 T 554.5 T 230.7 T 70%
The keytab details (path of keytab, KDC REALM) I mentioned in the spark configuration is present on the server serverB
When I try to load the file as:
csv_df = spark.read.format('csv').load('hdfs://botest01/test/mr/wc.txt')
The code results in UnknownHostException as below:
>>> tdf = spark.read.format('csv').load('hdfs://clusterB/test/mr/wc.txt')
20/07/15 15:40:36 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load
return self._df(self._jreader.load(path))
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'java.net.UnknownHostException: clusterB'
Could anyone let me know what is the mistake I did here and how can I fix it ?

Pyspark in Docker based on Hortonworks 2.6.1 is throwing error with EnableHiveSupport()

I am trying to build an edge-node using docker with HDP2.6.1. Everything is available and running except Spark Support. I was able to install and run pyspark but only when I comment enableHiveSupport(). I have copied over the hive-site.xml to /etc/spark2/conf as well from ambari and all the spark confs are matching with the cluster settings. But still get this error:
17/10/27 02:35:57 WARN conf.HiveConf: HiveConf of name hive.groupby.position.alias does not exist
17/10/27 02:35:57 WARN conf.HiveConf: HiveConf of name hive.mv.files.thread does not exist
Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/shell.py", line 43, in <module>
spark = SparkSession.builder\
File "/usr/hdp/current/spark2-client/python/pyspark/sql/session.py", line 187, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
>>> spark.createDataFrame([(1,'a'), (2,'b')], ['id', 'nm'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'spark' is not defined
I have tried to search this error, but all the results that I get are possible windows errors related to permissions and hive-site.xml missing. But i am building it on centos:7.3.1611. And installing the following:
RUN wget http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.1.0/hdp.repo
RUN cp hdp.repo /etc/yum.repos.d
RUN yum -y install hadoop sqoop spark2_2_6_1_0_129-master spark2_2_6_1_0_129-python hive-hcatalog
So the solution to the above problem is that the hive-site.xml needs to only contain the property for hive.metastore.uris and NOTHING ELSE. (Reference: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.2/bk_spark-component-guide/content/spark-config-hive.html). Once you take the other properties out, it works like a charm!

Spark on EMR Yarn - EOF Error

We are running some PySpark processes on Yarn, when the datasets increase in size we are getting this error in the yarn log:
Traceback (most recent call last):
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 157, in manager
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 61, in worker
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/worker.py", line 136, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/home/hadoop/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 544, in read_int
raise EOFError
java.net.SocketException: Socket is closed
at java.net.Socket.shutdownOutput(Socket.java:1496)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply$mcV$sp(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3$$anonfun$apply$2.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.tryLog(Utils.scala:1785)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:256)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
We are running on a EMR Setup 3*m3.xlarge - each with 4vCPUs, 15GiB and 2x40 GB
The job is executed with the following sh script:
export SPARK_HOME=/home/hadoop/spark
JARS="/home/hadoop/avro-1.7.7.jar,/home/hadoop/spark-avro-master/target/scala-2.10/spark-avro_2.10-1.0.0.jar”
$SPARK_HOME/bin/spark-submit --master yarn-cluster --py-files deploy.zip --jars $JARS main.py
where deploy.zip contains some utility methods and lambda functions
No other configuration changes were made to the cluster.
By looking at the UI seems that the all the jobs are finishing with a SUCCESS status, nevertheless we would like to get rid of this issue, or at the very least to understand what's causing it.
Would you have any idea on what it might be the origin of the error?
Thanks!

"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

This issue is a continuation of my previous question here, which was seemingly resolved but leads to here as another issue.
I am using Spark 1.4.0 on Cloudera QuickstartVM CHD-5.4.0.
When I run my PySpark script as a SparkAction in Oozie, I encounter this error in the Oozie job / container logs:
KeyError: 'SPARK_HOME'
Then I came across this solution and this which are actually for Spark 1.3.0, although I still did try. The documentations seem to say that this issue is already fixed for Spark version 1.3.2 and 1.4.0 (but here I am, encountering the same issue).
The suggested solution in the link was that I need to set spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME to anything, even if it's just any path that does not point to actual SPARK_HOME (i.e., /bogus, although I did set these to actual SPARK_HOME).
Here's my workflow after:
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local[2]</master>
<mode>client</mode>
<name>${name}</name>
<jar>${workflowRootLocal}/lib/my_pyspark_job.py</jar>
<spark-opts>--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark spark.executorEnv.SPARK_HOME=/usr/lib/spark</spark-opts>
</spark>
Which seems to solve the original problem above. However, it leads to another error when I try to inspect stderr of Oozie container log:
Error: Cannot load main class from JAR file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1437103727449_0011/container_1437103727449_0011_01_000001/spark.executorEnv.SPARK_HOME=/usr/lib/spark
If I am using Python, it should not expect for a main class right? Please note in my previous related post that the Oozie job example shipped with Cloudera QuickstartVM CDH-5.4.0, which features a SparkAction written in Java was working in my tests. It seems that the issue is only in Python.
Appreciate greatly anyone that can help.
Rather than setting spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME variables, try and add the following lines of code to your python script before setting your SparkConf()
os.environ["SPARK_HOME"] = "/path/to/spark/installed/location"
Found the reference here
This helped me resolve the error you face, but I faced the following error afterwards
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
main()
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
sc = SparkContext(conf=conf)
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package

Spark Notebook not working on HUE for EMR

Ok, I've got hue 3.8 pointing at my EMR cluster, and it's mostly working. THe one thing I'm missing that I really care about at this point is spark notebook
when I attempt to choose a language for a snippet, there is an error, "No usable value for lang Did not find value which can be converted into java.lang.String (error 400)" and the logs say this:
[03/Jun/2015 11:38:59 -0700] decorators ERROR error running <function create_session at 0x7fe30acd1d70>
Traceback (most recent call last):
File "/usr/local/hue/apps/spark/src/spark/decorators.py", line 77, in decorator
return func(*args, **kwargs)
File "/usr/local/hue/apps/spark/src/spark/api.py", line 44, in create_session
response['session'] = get_api(request.user, snippet).create_session(lang=snippet['type'])
File "/usr/local/hue/apps/spark/src/spark/models.py", line 284, in create_session
response = api.create_session(kind=lang)
File "/usr/local/hue/apps/spark/src/spark/job_server_api.py", line 87, in create_session
return self._root.post('sessions', data=json.dumps(kwargs), contenttype='application/json')
File "/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", line 122, in post
return self.invoke("POST", relpath, params, data, self._make_headers(contenttype, headers))
File "/usr/local/hue/desktop/core/src/desktop/lib/rest/resource.py", line 78, in invoke
urlencode=self._urlencode)
File "/usr/local/hue/desktop/core/src/desktop/lib/rest/http_client.py", line 161, in execute
raise self._exc_class(ex)
RestException: No usable value for lang
Did not find value which can be converted into java.lang.String (error 400)
Is this a problem with the software or my config?
THis might be tied to the fact that attempting to run sudo ./hue livy_server yields:
Failed to run spark-submit executable: java.io.IOException:
Cannot run program "spark-submit": error=2, No such file or directory
spark-submit does in fact exist and is in path
The
spark-submit
command comes from Spark, it needs to be present on the Hue machine.

Resources