spark version 3.0.1 run enableHiveSupport will rasie pyspark.sql.utils.IllegalArgumentException - apache-spark

I am using hadoop 2.10.x + hive 3.1.x + spark 3.0.1 and trying to load log file into hive by pyspark.
I followed the code from spark document to connect to hive.
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.enableHiveSupport() \
.getOrCreate()
but it always raise pyspark.sql.utils.IllegalArgumentException: <exception str() failed>.
Traceback (most recent call last):
File "log_extra.py", line 16, in <module>
.appName("Python Spark SQL Hive integration example") \
File "/usr/local/python37/lib/python3.7/site-packages/pyspark/sql/session.py", line 191, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "/usr/local/python37/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/local/python37/lib/python3.7/site-packages/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.IllegalArgumentException: <exception str() failed>
If I don't add config enableHiveSupport, this python script can run but can only connect to built-in hive.
I have put hive-site.xml in spark/conf.
Now I don't know how to connect to my hive by spark.

Related

Unable to fetch data from external Hive cluster in a Spark job running on an EMR cluster

I am executing a Spark job on an EMR cluster to connect and fetch data from tables residing in an external Hive server which is protected with Kerberos authentication. As a pre-requisite, I am downloading hive-site.xml of the remote Hive server into $SPARK_HOME/conf/ folder on the cluster. I am also copying a keytab and authenticating using kinit.
Authentication and connection to Hive metastore is getting established, and I'm able to get a list of databases with the following code snipped:
from pyspark import SparkContext, HiveContext
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.appName("Hive_check")\
.config("spark.hadoop.hive.metastore.uris", "thrift://ip-xx-xx-xx-xx.ec2.internal:9083")\
.config("spark.security.credentials.hadoopfs.enabled", "true")\
.enableHiveSupport().getOrCreate()
sc = spark_session.sparkContext
sc.setSystemProperty("hive.metastore.kerberos.principal", "hive/ip-xx-xx-xx-xx.ec2.internal#MY.REALM.COM")
sc.setSystemProperty("hive.metastore.kerberos.keytab.file", "/path/to/hive.service.keytab")
sc.setSystemProperty("javax.security.auth.useSubjectCredsOnly","false")
sc.setSystemProperty("hive.metastore.execute.setugi", "true")
hc = HiveContext(sc)
db_list = hc.sql("Show databases")
db_list.show()
However, when I try to run select queries on some existing tables (can query those tables from HUE), I get the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/context.py", line 433, in sql
return self.sparkSession.sql(sqlQuery)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/home/hadoop/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
return f(*a, **kw)
File "/home/hadoop/anaconda3/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o75.sql.
: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
I tried copying core-site.xml and yarn-site.xml into $SPARK_HOME/conf/ to resolve this issue. But now, I'm unable to create spark session itself. The program gets stuck for over 15 minutes with no logs at SparkSession....getOrCreate() step.
Is there any other configuration I'm missing? What could be the possible reason of getting connected to hive metastore but not able to fetch the data from the tables?

spark.yarn.jars - py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:

I am trying to run a spark job using a spark2-submit on command. The version of the spark installed on the cluster is cloudera's spark2.1.0 and I am specifying my jars for version 2.4.0 using conf spark.yarn.jars as shown below -
spark2-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/virtualenv/path/bin/python \
--conf spark.yarn.jars=hdfs:///some/path/spark24/*\
--conf spark.yarn.maxAppAttempts=1\
--conf spark.task.cpus=2\
--executor-cores 2\
--executor-memory 4g\
--driver-memory 4g\
--archives /virtualenv/path \
--files /etc/hive/conf/hive-site.xml \
--name my_app\
test.py
This is the code I have in test.py -
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Spark Session created")
On running the submit command, I see messages like below -
yarn.Client: Source and destination file systems are the same. Not copying hdfs:///some/path/spark24/some.jar
And then I get this error on the line where spark session is being created -
spark = SparkSession.builder.getOrCreate()
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 310, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 259, in _ensure_initialized
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/java_gateway.py", line 117, in launch_gateway
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 175, in java_import
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
Authentication error: unexpected command.
the py4j in the error is coming from the existing spark and not the versions in my jar. Were my spark24 jars not picked up? The same code runs ok if I remove the conf for jars but probably from the existing spark version 2.1.0. Any clues on how to fix this?
Thanks.
The problem turned out to be that python was running from the wrong place. I had to submit from correct place this way -
PYTHONPATH=./${virtualenv}/venv/lib/python3.6/site-packages/ spark2-submit

SparkFiles.get() is not able to fetch file uploaded using --files option of spark-submit

Spark-submit --files option says that the files can be accessed using SparkFiles.get('files.txt')
So I wrote a simple program
from pyspark.sql import SparkSession
from pyspark import SparkFiles
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
print("testfile path : "+SparkFiles.get('testfile.txt'))
df=spark.read.text(SparkFiles.get('testfile.txt'))
df.show()
And then run using the below command :
spark-submit --master yarn --deploy-mode client --files testfile.txt testsubmit.py
From the logs I can see thaat the "testfile.txt" has been copied to hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/testfile.txt
20/03/05 04:41:08 INFO Client: Source and destination file systems are the same. Not copying hdfs://dummyIP.com:8020/hdp/apps/2.6.5.0-292/spark2/spark2-hdp-yarn-archive.tar.gz
20/03/05 04:41:08 INFO Client: Uploading resource file:/root/sumit/test/testfile.txt -> hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/testfile.txt
20/03/05 04:41:09 INFO Client: Uploading resource file:/usr/hdp/current/spark2-client/python/lib/pyspark.zip -> hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/pyspark.zip
But SparkFiles.get('testfile.txt') is trying to fetch 'testfile.txt' from hdfs://dummyIP.com:8020/tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt
testfile path : /tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt
Traceback (most recent call last):
File "/root/sumit/test/testsubmit.py", line 7, in <module>
df=spark.read.text(SparkFiles.get('testfile.txt'))
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 328, in text
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://dummyIP.com:8020/tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt;'
Also note that text() function will be executed on executor nodes as is mentioned the Spark Documentations. So it seems like SparkFiles.get('files.txt') is not reading from the same location where the --files is uploading it.

hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
spark-submit --master yarn-cluster test_pokes.py
However, I encounter the error:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 683, in _ssql_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 692, in _get_hive_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
...
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
...
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
I have seen a number of similar posts for other Hadoop distributions, but not for BigInsights on Cloud.
The solution to this error was to add the jars:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
However, I then get a different error:
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
I've added the other question here: Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
The final solution is captured here: https://stackoverflow.com/a/41272260/1033422

PySpark HBase/Phoenix integration

I'm supposed to read Phoenix data into pyspark.
edit:
I'm using Spark HBase converters:
Here is a code snippet:
port="2181"
host="zookeperserver"
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "camel", "hbase.mapreduce.scan.columns": "data:a"}
sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)
Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
jconf, batchSize)
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:130)
Any help would be much appreciated.
Thank you!
/Tina
Using spark phoenix plugin is the recommended approach.
please find details about phoenix spark plugin here
Environment : tested with AWS EMR 5.10 , PySpark
Following are the steps
Create Table in phoenix https://phoenix.apache.org/language/
Open Phoenix shell
“/usr/lib/phoenix/bin/sqlline.py“
DROP TABLE IF EXISTS TableName;
CREATE TABLE TableName (DOMAIN VARCHAR primary key);
UPSERT INTO TableName (DOMAIN) VALUES('foo');
Download spark phoenix plugin jar
download spark phoenix plugin jar from https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core/4.11.0-HBase-1.3
you need phoenix--HBase--client.jar , i used phoenix-4.11.0-HBase-1.3-client.jar as per my phoenix and hbase version
From your hadoop home directory, setup the following variable:
phoenix_jars=/home/user/apache-phoenix-4.11.0-HBase-1.3-bin/phoenix-4.11.0-HBase-1.3-client.jar
Start PySpark shell and add the dependency in Driver and executer classpath
pyspark --jars ${phoenix_jars} --conf spark.executor.extraClassPath=${phoenix_jars}
--Create ZooKeeper URL ,Replace with your cluster zookeeper quorum, you can check from hbase-site.xml
emrMaster = "ZooKeeper URL"
df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.load()
df.show()
df.columns
df.printSchema()
df1=df.replace(['foo'], ['foo1'], 'DOMAIN')
df1.show()
df1.write \
.format("org.apache.phoenix.spark") \
.mode("overwrite") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.save()
There are two ways to do this :
1) As Phoenix has a JDBC layer you can use Spark JdbcRDD to read data from Phoenix in Spark https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/rdd/JdbcRDD.html
2) Using Spark HBAse Converters:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
https://github.com/apache/spark/tree/master/examples/src/main/python

Resources