standalone spark cluster read from gcp-bq - apache-spark

We have a local standalone spark cluster (2.3.1) trying to connect a read from GCP-BQ table.
df.show() is throwing bunch of errors.
---error--
m/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o59.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
Any pointers will help
command used to submit - spark-2.3.1-bin-hadoop2.7/bin/spark-submit --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.1 pyspark-bq.py
thanks
V

Related

hbase-spark connector for spark2.1.0

I am using below stack
Hadoop-2.7.7
spark-2.4.5
Hbase-2.1.0
zk-3.5.9
I want to read and write data on hbase using spark with spark-submit command. But i was unable to do so.
I have successfully started all services and also searched connectors for same but i didn't get.
I have tried to create connectors using below link https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_HBase_Connector.md
But connector build getting failed somehow i have made it possible to get connectors from internet and tried with it
when i try to launch spark submit with below command my application is failing
spark-submit --jars /home/bigdata/downloads/hbase-spark-1.0.0.jar --packages org.apache.hbase:hbase-shaded-mapreduce:2.1.0 /home/bigdata/hbasefload.py
Error:
Traceback (most recent call last):
File "/home/bigdata/hbasefload.py", line 35, in <module>
.option("hbase.zookeeper.quorum", "node2.ellicium.com:2181")\
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 73 7, in save
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark/spark245/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark/spark245/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328 , in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o60.save.
: java.util.NoSuchElementException: key not found: catalog
As i try to write on to hbase using spark-shell with above jars it successfully get executed but failing with spark-submit.

Unable to save model in Apache Spark -- Py4JJavaError

We're getting an error while trying to save a model. model.save('DT')
Py4JJavaError: An error occurred while calling o822.save.
: org.apache.spark.SparkException: Job aborted.```
Complete Error Stack --> http://dpaste.com/16Y07B9
Anything we missed here? It is creating the folder but not writing anything.
OS: Windows 10
TIA
So it turns out I was using Spark 3.0.0Preview and ran into trouble. Switched to 2.4.5 and resolved it.

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

Trying to read files from s3 using hadoop-aws, The command used to run code is mentioned below.
please help me resolve this and understand what I am doing wrong.
# run using command
# time spark-submit --packages org.apache.hadoop:hadoop-aws:3.2.1 connect_s3_using_keys.py
from pyspark import SparkContext, SparkConf
import ConfigParser
import pyspark
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Deepak_1ST_job")
sc = SparkContext(conf=conf)
sc.setLogLevel("ERROR")
hadoop_conf = sc._jsc.hadoopConfiguration()
config = ConfigParser.ConfigParser()
config.read("/home/deepak/Desktop/secure/awsCred.cnf")
accessKeyId = config.get("aws_keys", "access_key")
secretAccessKey = config.get("aws_keys", "secret_key")
hadoop_conf.set(
"fs.s3n.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs3a.access.key", accessKeyId)
hadoop_conf.set("s3a.secret.key", secretAccessKey)
sqlContext = pyspark.SQLContext(sc)
df = sqlContext.read.json("s3a://bucket_name/logs/20191117log.json")
df.show()
EDIT 1:
As I am new to pyspark I am unaware of these dependencies, also the error is not easily understandable.
getting error as
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 98, in deco
File "/home/deepak/spark/spark-3.0.0-preview-bin-hadoop3.2/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o28.json.
: java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;Ljava/lang/Object;)V
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:816)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:792)
at org.apache.hadoop.fs.s3a.S3AUtils.getAWSAccessKeys(S3AUtils.java:747)
at org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.
I had the same issue with spark 3.0.0 / hadoop 3.2.
What worked for me was to replace the hadoop-aws-3.2.1.jar in spark-3.0.0-bin-hadoop3.2/jars with hadoop-aws-3.2.0.jar found here: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/3.2.0
Check your spark guava jar version. If you download spark from Amazon like me from the link (https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz) in their documentation. You can see the include guava version is guava-14.0.1.jar and their container is using guava-21.0.jar
I have reported the issue to them and they will repack their spark to include the correct version. If you interested in the bug self, here is the link. https://hadoop.apache.org/docs/r3.2.0/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#ClassNotFoundException:_org.apache.hadoop.fs.s3a.S3AFileSystem

"KeyError: 'SPARK_HOME' ", "can't load main class from JAR" in running PySpark as an Oozie workflow job

This issue is a continuation of my previous question here, which was seemingly resolved but leads to here as another issue.
I am using Spark 1.4.0 on Cloudera QuickstartVM CHD-5.4.0.
When I run my PySpark script as a SparkAction in Oozie, I encounter this error in the Oozie job / container logs:
KeyError: 'SPARK_HOME'
Then I came across this solution and this which are actually for Spark 1.3.0, although I still did try. The documentations seem to say that this issue is already fixed for Spark version 1.3.2 and 1.4.0 (but here I am, encountering the same issue).
The suggested solution in the link was that I need to set spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME to anything, even if it's just any path that does not point to actual SPARK_HOME (i.e., /bogus, although I did set these to actual SPARK_HOME).
Here's my workflow after:
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${resourceManager}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local[2]</master>
<mode>client</mode>
<name>${name}</name>
<jar>${workflowRootLocal}/lib/my_pyspark_job.py</jar>
<spark-opts>--conf spark.yarn.appMasterEnv.SPARK_HOME=/usr/lib/spark spark.executorEnv.SPARK_HOME=/usr/lib/spark</spark-opts>
</spark>
Which seems to solve the original problem above. However, it leads to another error when I try to inspect stderr of Oozie container log:
Error: Cannot load main class from JAR file:/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/cloudera/appcache/application_1437103727449_0011/container_1437103727449_0011_01_000001/spark.executorEnv.SPARK_HOME=/usr/lib/spark
If I am using Python, it should not expect for a main class right? Please note in my previous related post that the Oozie job example shipped with Cloudera QuickstartVM CDH-5.4.0, which features a SparkAction written in Java was working in my tests. It seems that the issue is only in Python.
Appreciate greatly anyone that can help.
Rather than setting spark.yarn.appMasterEnv.SPARK_HOME and spark.executorEnv.SPARK_HOME variables, try and add the following lines of code to your python script before setting your SparkConf()
os.environ["SPARK_HOME"] = "/path/to/spark/installed/location"
Found the reference here
This helped me resolve the error you face, but I faced the following error afterwards
Traceback (most recent call last):
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 129, in <module>
main()
File "/usr/hdp/current/spark-client/AnalyticsJar/boxplot_outlier.py", line 60, in main
sc = SparkContext(conf=conf)
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 107, in __init__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 155, in _do_init
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/pyspark/context.py", line 201, in _initialize_context
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/java_gateway.py", line 701, in __call__
File "/hadoop/yarn/local/filecache/1314/spark-core_2.10-1.1.0.jar/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s signer information does not match signer information of other classes in the same package

Spark standalone mode : failed on connection exception:

I am running a spark(1.2.1) standalone cluster on my virtual machine(Ubuntu 12.04). I can run the example such as als.py and pi.py successfully. But I can't run the workcount.py example because a connection error will occur.
bin/spark-submit --master spark://192.168.1.211:7077 /examples/src/main/python/wordcount.py ~/Documents/Spark_Examples/wordcount.py
The error message is as below:
15/03/13 22:26:02 INFO BlockManagerMasterActor: Registering block manager a12:45594 with 267.3 MB RAM, BlockManagerId(0, a12, 45594)
15/03/13 22:26:03 INFO Client: Retrying connect to server: a11/192.168.1.211:9000. Already tried 4 time(s).
......
Traceback (most recent call last):
File "/home/spark/spark/examples/src/main/python/wordcount.py", line 32, in <module>
.reduceByKey(add)
File "/home/spark/spark/lib/spark-assembly-1.2.1 hadoop1.0.4.jar/pyspark/rdd.py", line 1349, in reduceByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1559, in combineByKey
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 1942, in _defaultReducePartitions
File "/home/spark/spark/lib/spark-assembly-1.2.1-hadoop1.0.4.jar/pyspark/rdd.py", line 297, in getNumPartitions
......
py4j.protocol.Py4JJavaError: An error occurred while calling o23.partitions.
java.lang.RuntimeException: java.net.ConnectException: Call to a11/192.168.1.211:9000 failed on connection exception: java.net.ConnectException: Connection refused
......
I didn't use Yarn or ZooKeeper. And all the virtual machines can connect to each other via ssh without password. I also set the SPARK_LOCAL_IP for master and workers.
I think that wordcount.py example is accessing hdfs to reading lines in a file (and then count the words)
Something like:
sc.textFile("hdfs://<master-hostname>:9000/path/to/whatever")
Port 9000 is usually used for hdfs.
Please be sure that this file is accessible or do not use hdfs for that example :).
I hope it helps.

Resources