Oozie spark action involving spark sql fails if enableHiveSupport() is used - apache-spark

Using AWS EMR here.
Release label:emr-6.7.0
Hadoop distribution:Amazon 3.2.1
Applications:Spark 3.2.1, JupyterHub 1.4.1, Hue 4.10.0, Livy 0.7.1, Hive 3.1.3, Oozie 5.2.1
I'm trying to run a simple oozie workflow (via Hue) consisting of a spark action (python) that tries to create a database using spark.sql("create database temp_db_analytics").
My spark session is initialized as spark = SparkSession.builder.appName('test-app').enableHiveSupport().getOrCreate()
workflow.xml
<workflow-app name="test-workflow" xmlns="uri:oozie:workflow:0.5">
<start to="spark-3853"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-3853">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>client</mode>
<name></name>
<jar>test-job.py</jar>
<file>s3://mybucket/test-job.py#test-job.py</file>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
job.properties
oozie.use.system.libpath=True
master=yarn-cluster
mode=cluster
send_email=False
dryrun=False
nameNode=hdfs://<hostname>:8020
jobTracker=<hostname>:8032
security_enabled=False
oozie.wf.application.path=${nameNode}/user/hadoop/SampleWorkflow
test-job.py
from pyspark.sql import SparkSession
if __name__=='__main__':
spark = SparkSession.builder.appName('test-app') \
.enableHiveSupport() \
.getOrCreate()
spark.sql("create database temp_db_analytics").show()
The corresponding error message when I run this using Oozie command line or Hue UI (only relevant portion shown here):
stdout
Traceback (most recent call last):
File "/mnt/yarn/filecache/299/test-job.py", line 16, in <module>
spark.sql("create database temp_db_analytics").show()
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/pyspark.zip/pyspark/sql/session.py", line 723, in sql
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/mnt/yarn/usercache/hadoop/appcache/application_1669716233310_0015/container_1669716233310_0015_01_000001/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o46.sql.
: java.lang.NoClassDefFoundError: org/apache/calcite/plan/RelOptRule
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2729)
at java.lang.Class.privateGetMethodRecursive(Class.java:3076)
at java.lang.Class.getMethod0(Class.java:3046)
at java.lang.Class.getMethod(Class.java:1812)
at org.apache.spark.sql.hive.client.HiveClientImpl.getHive(HiveClientImpl.scala:205)
at org.apache.spark.sql.hive.client.HiveClientImpl.client(HiveClientImpl.scala:269)
at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:236)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:235)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:285)
at org.apache.spark.sql.hive.client.HiveClientImpl.databaseExists(HiveClientImpl.scala:396)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:249)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:105)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:249)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:151)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:141)
at org.apache.spark.sql.internal.SharedState.isDatabaseExistent$1(SharedState.scala:185)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:217)
at org.apache.spark.sql.internal.SharedState.globalTempViewManager(SharedState.scala:169)
at org.apache.spark.sql.hive.HiveSessionStateBuilder.$anonfun$catalog$2(HiveSessionStateBuilder.scala:53)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager$lzycompute(SessionCatalog.scala:119)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.globalTempViewManager(SessionCatalog.scala:119)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:244)
at org.apache.spark.sql.execution.command.CreateDatabaseCommand.run(ddl.scala:83)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:115)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:112)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:83)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:519)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:495)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:108)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:95)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:93)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:221)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:98)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: org.apache.calcite.plan.RelOptRule
at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:267)
at org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:256)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 74 more
Things I have tried
Running the same job without enableHiveSupport() while creating spark session. This works fine however, I'm given to understand this uses an in-memory hive metastore rather than a persistent metastore which is not the desired outcome.
Running the same job using spark-submit --master yarn --deploy-mode cluster test-job.py works fine.
Explicitly passing the hive-site.xml to the workflow (only tried this on hue) using the --files option. Same error occurs.
Providing the calcite-core JAR to the spark job (via --packages). This gives a different error java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/CreationMetadata at the same line as above. It feels like there is a chain of unfulfilled dependencies. Don't get why this is happening only via Oozie
The dependency on calcite-core seemed to be related to Hive Cost Based Optimizer (check this out). Even tried adding the property hive.cbo.enable=false to hive-site.xml. Still the same issue.
What am I missing here? Any suggestions welcome. More than happy to furnish any more info.

Related

How to resolve spark error when reading from s3

I am getting an error(java.io.IOException: No FileSystem for scheme: S3a) when running a spark application. I have looked through various other questions regarding this type of error, but Im not able to determine the solution. Spark is version 3.1.2
Updated details below to reflect current state
pyspark script:
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.4 pyspark-shell'
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("s3reader") \
.getOrCreate()\
sc = spark.sparkContext
#sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
#sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "xxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "xxxxxxxxxxxx")
#sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint","xxx.x.xxx.x.com", "us-1-east")
#sc._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
print(df)
here are my jar versions:
cloud#spark-dev-master:/usr/local/spark/jars$ ls -ltr *aws*
-rw-rw-r-- 1 cloud cloud 126287 Aug 18 2016 hadoop-aws-2.7.4.jar
-rw-rw-r-- 1 cloud cloud 4479 Sep 17 02:36 aws-java-sdk-1.7.4.jar
stack trace:
Traceback (most recent call last):
File "/home/cloud/sparks3test.py", line 18, in <module>
df = spark.read.json("S3a://silver/testfolder/4a2426b2-856c-4e9b-b698-b3dcdca74f48")
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 372, in json
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
File "/usr/local/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o33.json.
: java.io.IOException: No FileSystem for scheme: S3a
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:377)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:519)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
You need to use hadoop-aws version 3.2.0.
You can refer my previous answer here.
I am getting an error(java.lang.NoClassDefFoundError:org/apache/hadoop/fs/StreamCapabilities)
This is what you see when you mix hadoop-aws and hadoop-common JAR versions. They must match point for point (as spark JARs also require).
Do not attempt to work around this except by syncing up JARs, you will only be moving stack traces around.
See Hadoop troubleshooting s3a
As there still appeared to be jar dependecies issues, I did a fresh install on spark using 3.1.2 and hadoop 3.2.0 and aligned hadoop-aws and java-sdk jars with aws-common jar version on the master and worker nodes. This corrected the file system issue. Consequently, upgrading to 3.2.0 also corrected the endpoint issue we were running to as well as path.style.access=true is not supported in any hadoop version older than 2.8.0. That issue was documented here: https://issues.apache.org/jira/browse/HADOOP-12963 for reference.

How to access azure block file system (abfss) from a standalone spark cluster

I have a need to use a standalone spark cluster (2.4.7) with Hadoop 3.2 and I am trying to access the ADLS Gen2 storage through pyspark.
I've added a shared key to my core-site.xml and I can ls the storage account like so:
hadoop fs -ls abfss://<container>#<storage_account>.dfs.core.windows.net/
But when I try to read a json file in pyspark (using the shell) like so:
spark.conf.set("fs.azure.account.key.<<storageaccount>>.dfs.core.windows.net", "<<key>>")
spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("abfss://<container>#<storageaccount>.dfs.core.windows.net/example.json").show()
I get the following error:
WARN streaming.FileStreamSink: Error while looking for metadata directory.
File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o103.json.
: java.io.IOException: No FileSystem for scheme: abfss
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:561)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:559)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:355)
at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:559)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:411)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)WARN: command not found
I have also configured the HADOOP lib paths for SPARK_DIST_CLASSPATH to point to $(hadoop classpath) and copied over the hadoop-azure jar to the hadoop/common folder. But still unable to access abfss via pyspark.
What could I be missing here?
Also tried answers given here
This article helps you to DIY: Apache Spark and ADLS Gen 2 support, make sure you have followed all the necessary steps to successfully configure ADLS gen2 on Apache Spark.

How do you read a file from Azure Blob w/ Apache Spark without Databricks but with wasbs on Windows 10?

I have azure-storage-8.6.0.jar and hadoop-azure-3.0.1.jar. I keep seeing from other forums that I have to modify the core-site.xml file in the etc folder in hadoop like so https://github.com/hning86/articles/blob/master/hadoopAndWasb.md. I didn't know I even needed to download all of hadoop to run Spark. I thought all I needed was the winutils.exe in hadoop/bin.
spark.read.load(f"wasbs://{container_name}#{storage_account_name}.blob.core.windows.net/{container_name}/myfile.txt" )
Py4JJavaError: An error occurred while calling o53.load.
: java.io.IOException: No FileSystem for scheme: wasbs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If you want to use pyspark to read CSV file from Azure blob storage on Windows 10, please refer to the following steps
Install pyspark
pip install pyspark
Code (create .py file)
from pyspark.sql import SparkSession
import traceback
try:
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set('fs.azure.account.key.<account name>.blob.core.windows.net',
'<account key>')
df = spark.read.option("header", True).csv(
'wasbs://<container name>#<account name>.blob.core.windows.net/<directory name>/<file name>')
df.show()
except Exception as exp:
print("Exception occurred")
print(traceback.format_exc())
Run code
cd <your python or env path>\Scripts
spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.1,com.microsoft.azure:azure-storage:8.6.5 <your py file path>

How to read scylladb table in pyspark dataframe

I am trying to read scylladb table installed one pc into pyspark dataframe on another pc.
The 2 pcs have ssh connectivity and I am able to read the table via python code, there is problem only while connecting with spark.I have used this connector:
--packages datastax:spark-cassandra-connector:2.3.0-s_2.11 ,
My spark -version = 2.3.1 , scala-version-2.11.8.
**First Approach**
from pyspark import SparkConf
from pyspark import SparkContext
from pyspark.sql import SparkSession
conf = SparkConf().set("spark.cassandra.connection.host","192.168.0.118")
sc = SparkContext(conf = conf)
spark=SparkSession.builder.config(conf=conf).appName('FinancialRecon').getOrCreate()
sqlContext =SQLContext(sc)
data=spark.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
Resulting error:
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 172, in load
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load.
: java.lang.ClassNotFoundException: org.apache.spark.Logging was removed in Spark 2.0. Please check if your library is compatible with Spark 2.0
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:646)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 13 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 33 more
Another Approch that I have used is :
data=sc.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
For this I get:
AttributeError: 'SparkContext' object has no attribute 'read'
Third Approach:
data=sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="datarecon",keyspace="finrecondata").load().show()
For this I get the same error as the first approach.
Please advice whether it is scylla spark connector issue or some spark library issue and how to solve it.
Follow these steps :
1.Run the spark-shell with the packages line.To configure the default Spark Configuration pass key value pairs with --conf, In my case scylla host is 172.17.0.2
bin/spark-shell --conf spark.cassandra.connection.host=172.17.0.2 --packages datastax:spark-cassandra-connector:2.3.0-s_2.11
2.Enable Cassandra-specific functions on the SparkContext, SparkSession, RDD, and DataFrame:
import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
3.Load data from scylla
val rdd = sc.cassandraTable("my_keyspace", "my_table")
4.Test
scala> rdd.collect().foreach(println)
CassandraRow{id: 1, name: ash}
The resulting error occurs due to a version conflict. Maybe you can solve it reading here.
The first approach will work because read method is available on SparkSession.

Running Spark job in Oozie using Yarn-cluster

I had created a Spark Job using Oozie which configure run on yarn-cluster.
The Spark program is written in Scala and is a very simple program which just initialize SparkContext, println("hello world") and stop the SparkContext.
Below is the workflow.xml:
<workflow-app name="My_Workflow" xmlns="uri:oozie:workflow:0.5">
<start to="spark-0177"/>
<kill name="Kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="spark-0177">
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn-cluster</master>
<mode>cluster</mode>
<name>MySpark</name>
<class>com.test1</class>
<jar>/user/hue/oozie/workspaces/tl_test/lib/testOozie1.jar</jar>
<spark-opts>--executor-cores 2 --driver-memory 5g --num-executors 2 --executor-memory 5g</spark-opts>
</spark>
<ok to="End"/>
<error to="Kill"/>
</action>
<end name="End"/>
</workflow-app>
However, i was getting the following error:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Can not create a Path from an empty string
java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
at org.apache.hadoop.fs.Path.<init>(Path.java:135)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:191)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$3.apply(Client.scala:254)
at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$3.apply(Client.scala:248)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:248)
at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:384)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:102)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:623)
at org.apache.spark.deploy.yarn.Client$.main(Client.scala:651)
at org.apache.spark.deploy.yarn.Client.main(Client.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
at org.apache.oozie.action.hadoop.SparkMain.runSpark(SparkMain.java:105)
at org.apache.oozie.action.hadoop.SparkMain.run(SparkMain.java:96)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:46)
at org.apache.oozie.action.hadoop.SparkMain.main(SparkMain.java:40)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:228)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:370)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:295)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:181)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Oozie Launcher failed, finishing Hadoop job gracefully
Oozie Launcher, uploading action data to HDFS sequence file: hdfs://MYRNDSVRVM350:8020/user/oozie-oozi/0000084-150828094553499-oozie-oozi-W/spark-156b--spark/action-data.seq
Oozie Launcher ends
Please help as i'm totally stuck already.
Thanks.
I meet the exactly same wired problem. It turns out that multiple spaces in <spark-opts> cause this un-informtive error. You have two spaces between --executor-cores 2 --driver-memory 5g.
Can not create a Path from an empty string <- Spark is trying to stage your code, but your Oozie workflow doesn't provide the jar path, hence the empty path exception.
Providing the jar path in Oozie will fix this

Resources