hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory" - apache-spark

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
spark-submit --master yarn-cluster test_pokes.py
However, I encounter the error:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 683, in _ssql_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/context.py", line 692, in _get_hive_ctx
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 1064, in __call__
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0476/container_e09_1477084339086_0476_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
...
...
Caused by: javax.jdo.JDOFatalUserException: Class org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
NestedThrowables:
java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory
...
...
at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
I have seen a number of similar posts for other Hadoop distributions, but not for BigInsights on Cloud.

The solution to this error was to add the jars:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
test_pokes.py
However, I then get a different error:
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
I've added the other question here: Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
The final solution is captured here: https://stackoverflow.com/a/41272260/1033422

Related

spark.yarn.jars - py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:

I am trying to run a spark job using a spark2-submit on command. The version of the spark installed on the cluster is cloudera's spark2.1.0 and I am specifying my jars for version 2.4.0 using conf spark.yarn.jars as shown below -
spark2-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/virtualenv/path/bin/python \
--conf spark.yarn.jars=hdfs:///some/path/spark24/*\
--conf spark.yarn.maxAppAttempts=1\
--conf spark.task.cpus=2\
--executor-cores 2\
--executor-memory 4g\
--driver-memory 4g\
--archives /virtualenv/path \
--files /etc/hive/conf/hive-site.xml \
--name my_app\
test.py
This is the code I have in test.py -
import os
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
print("Spark Session created")
On running the submit command, I see messages like below -
yarn.Client: Source and destination file systems are the same. Not copying hdfs:///some/path/spark24/some.jar
And then I get this error on the line where spark session is being created -
spark = SparkSession.builder.getOrCreate()
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 310, in getOrCreate
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 115, in __init__
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 259, in _ensure_initialized
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/java_gateway.py", line 117, in launch_gateway
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 175, in java_import
File "/opt/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 323, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling None.None. Trace:
Authentication error: unexpected command.
the py4j in the error is coming from the existing spark and not the versions in my jar. Were my spark24 jars not picked up? The same code runs ok if I remove the conf for jars but probably from the existing spark version 2.1.0. Any clues on how to fix this?
Thanks.
The problem turned out to be that python was running from the wrong place. I had to submit from correct place this way -
PYTHONPATH=./${virtualenv}/venv/lib/python3.6/site-packages/ spark2-submit

SparkFiles.get() is not able to fetch file uploaded using --files option of spark-submit

Spark-submit --files option says that the files can be accessed using SparkFiles.get('files.txt')
So I wrote a simple program
from pyspark.sql import SparkSession
from pyspark import SparkFiles
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
print("testfile path : "+SparkFiles.get('testfile.txt'))
df=spark.read.text(SparkFiles.get('testfile.txt'))
df.show()
And then run using the below command :
spark-submit --master yarn --deploy-mode client --files testfile.txt testsubmit.py
From the logs I can see thaat the "testfile.txt" has been copied to hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/testfile.txt
20/03/05 04:41:08 INFO Client: Source and destination file systems are the same. Not copying hdfs://dummyIP.com:8020/hdp/apps/2.6.5.0-292/spark2/spark2-hdp-yarn-archive.tar.gz
20/03/05 04:41:08 INFO Client: Uploading resource file:/root/sumit/test/testfile.txt -> hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/testfile.txt
20/03/05 04:41:09 INFO Client: Uploading resource file:/usr/hdp/current/spark2-client/python/lib/pyspark.zip -> hdfs://dummyIP.com:8020/user/root/.sparkStaging/application_1581404080152_0079/pyspark.zip
But SparkFiles.get('testfile.txt') is trying to fetch 'testfile.txt' from hdfs://dummyIP.com:8020/tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt
testfile path : /tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt
Traceback (most recent call last):
File "/root/sumit/test/testsubmit.py", line 7, in <module>
df=spark.read.text(SparkFiles.get('testfile.txt'))
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 328, in text
File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
File "/usr/hdp/current/spark2-client/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://dummyIP.com:8020/tmp/spark-f7fedc0b-c3c7-4f6e-b72c-fc0618a03deb/userFiles-c90e2d49-c153-4945-bbe2-b006221002f9/testfile.txt;'
Also note that text() function will be executed on executor nodes as is mentioned the Spark Documentations. So it seems like SparkFiles.get('files.txt') is not reading from the same location where the --files is uploading it.

Not able to run the Hive sql via Spark

I am trying to execute hive SQL via spark code but it is throwing below mentioned error. I can only select data from hive table.
My spark version is 1.6.1
My Hive version is 1.2.1
command to run spark submit
spark-submit --master local[8] --files /srv/data/app/spark/conf/hive-site.xml test_hive.py
code:-
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql import HiveContext
sc=SparkContext()
sqlContext = SQLContext(sc)
HiveContext = HiveContext(sc)
#HiveContext.setConf("yarn.timeline-service.enabled","false")
#HiveContext.sql("SET spark.sql.crossJoin.enabled=false")
HiveContext.sql("use default")
HiveContext.sql("TRUNCATE TABLE default.test_table")
HiveContext.sql("LOAD DATA LOCAL INPATH '/srv/data/data_files/*' OVERWRITE INTO TABLE default.test_table")
df = HiveContext.sql("select * from version")
for x in df.collect():
print x
Error:-
17386 [Thread-3] ERROR org.apache.spark.sql.hive.client.ClientWrapper -
======================
HIVE FAILURE OUTPUT
======================
SET spark.sql.inMemoryColumnarStorage.compressed=true
SET spark.sql.thriftServer.incrementalCollect=true
SET spark.sql.hive.convertMetastoreParquet=false
SET spark.sql.broadcastTimeout=800
SET spark.sql.hive.thriftServer.singleSession=true
SET spark.sql.inMemoryColumnarStorage.partitionPruning=true
SET spark.sql.crossJoin.enabled=true
SET hive.support.sql11.reserved.keywords=false
SET spark.sql.crossJoin.enabled=false
OK
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ClassCastException: attempting to castjar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class
======================
END HIVE FAILURE OUTPUT
======================
Traceback (most recent call last):
File "/home/iip/hist_load.py", line 10, in <module>
HiveContext.sql("TRUNCATE TABLE default.tbl_wmt_pos_file_test")
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 580, in sql
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
File "/srv/data/OneClickProvision_1.2.2/files/app/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o46.sql.
: org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. ClassCastException: attempting to castjar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.classtojar:file:/srv/data/OneClickProvision_1.2.2/files/app/spark/assembly/target/scala-2.10/spark-assembly-1.6.2-SNAPSHOT-hadoop2.6.1.jar!/javax/ws/rs/ext/RuntimeDelegate.class
I can only select data from hive table.
It is perfectly normal and expected behavior. Spark SQL is not intended to be fully compatible with HiveQL or implement full set of features provided by Hive.
Overall, some compatibility is preserved, but is not guaranteed to be kept in the future, as Spark SQL converges to SQL 2003 standard.
From the post here:
Spark job fails with ClassCastException because of conflict in different version of same class in YARN and SPARK jar.
From Set below property in HiveContext:
hc = new org.apache.spark.sql.hive.HiveContext(sc)
hc.setConf("yarn.timeline-service.enabled","false")

Spark Hive reporting ClassNotFoundException: com.ibm.biginsights.bigsql.sync.BIEventListener

I'm attempting to run a pyspark script on BigInsights on Cloud 4.2 Enterprise that accesses a Hive table.
First I create the hive table:
[biadmin#bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive>
Then I create a simple pyspark script:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import HiveContext
hc = HiveContext(sc)
pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )
I attempt to execute with:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml \
test_pokes.py
However, I encounter the error:
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
Traceback (most recent call last):
File "test_pokes.py", line 8, in <module>
pokesRdd = hc.sql('select * from pokes')
File "/disk2/local/usercache/biadmin/appcache/application_1477084339086_0485/container_e09_1477084339086_0485_02_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
...
File /container_e09_1477084339086_0485_02_000001/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
...
...
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at
...
... 27 more
Caused by: MetaException(message:Failed to instantiate listener named: com.ibm.biginsights.bigsql.sync.BIEventListener, reason: java.lang.ClassNotFoundException: com.ibm.biginsights.bigsql.sync.BIEventListener)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.getMetaStoreListeners(MetaStoreUtils.java:1478)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:481)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:66)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:199)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
... 32 more
See also previous errors related to this issue:
hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"
Spark Hive reporting pyspark.sql.utils.AnalysisException: u'Table not found: XXX' when run on yarn cluster
The solution was to use hive-site.xml from the spark-client folder:
[biadmin#bi4c-xxxxxx-mastermanager ~]$ spark-submit \
--master yarn-cluster \
--deploy-mode cluster \
--jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
/usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
--files /usr/iop/current/spark-client/conf/hive-site.xml \
test_pokes.py
This is captured in the docs: http://www.ibm.com/support/knowledgecenter/SSPT3X_4.2.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_spark.html

PySpark HBase/Phoenix integration

I'm supposed to read Phoenix data into pyspark.
edit:
I'm using Spark HBase converters:
Here is a code snippet:
port="2181"
host="zookeperserver"
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
cmdata_conf = {"hbase.zookeeper.property.clientPort":port, "hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": "camel", "hbase.mapreduce.scan.columns": "data:a"}
sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=cmdata_conf)
Traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py", line 547, in newAPIHadoopRDD
jconf, batchSize)
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.io.IOException: No table was provided.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:130)
Any help would be much appreciated.
Thank you!
/Tina
Using spark phoenix plugin is the recommended approach.
please find details about phoenix spark plugin here
Environment : tested with AWS EMR 5.10 , PySpark
Following are the steps
Create Table in phoenix https://phoenix.apache.org/language/
Open Phoenix shell
“/usr/lib/phoenix/bin/sqlline.py“
DROP TABLE IF EXISTS TableName;
CREATE TABLE TableName (DOMAIN VARCHAR primary key);
UPSERT INTO TableName (DOMAIN) VALUES('foo');
Download spark phoenix plugin jar
download spark phoenix plugin jar from https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-core/4.11.0-HBase-1.3
you need phoenix--HBase--client.jar , i used phoenix-4.11.0-HBase-1.3-client.jar as per my phoenix and hbase version
From your hadoop home directory, setup the following variable:
phoenix_jars=/home/user/apache-phoenix-4.11.0-HBase-1.3-bin/phoenix-4.11.0-HBase-1.3-client.jar
Start PySpark shell and add the dependency in Driver and executer classpath
pyspark --jars ${phoenix_jars} --conf spark.executor.extraClassPath=${phoenix_jars}
--Create ZooKeeper URL ,Replace with your cluster zookeeper quorum, you can check from hbase-site.xml
emrMaster = "ZooKeeper URL"
df = sqlContext.read \
.format("org.apache.phoenix.spark") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.load()
df.show()
df.columns
df.printSchema()
df1=df.replace(['foo'], ['foo1'], 'DOMAIN')
df1.show()
df1.write \
.format("org.apache.phoenix.spark") \
.mode("overwrite") \
.option("table", "TableName") \
.option("zkUrl", emrMaster) \
.save()
There are two ways to do this :
1) As Phoenix has a JDBC layer you can use Spark JdbcRDD to read data from Phoenix in Spark https://spark.apache.org/docs/1.3.0/api/java/org/apache/spark/rdd/JdbcRDD.html
2) Using Spark HBAse Converters:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
https://github.com/apache/spark/tree/master/examples/src/main/python

Resources