Spark Loading Data from Azure Data Lake Store - Py4JJavaError: NoSuchMethodError - apache-spark

I am trying to load data in Spark 2.3.1 from ADLS using the following:
moviesfileAdls = "adl://xxxxxx.azuredatalakestore.net/Data/movies.csv"
dfMovies = spark.read.format("csv") \
.option("header", "true") \
.option("delimiter",",") \
.load(moviesfileAdls)
The setup: Hadoop-3.1.1 running on the same box as spark-2.3.1-bin-hadoop2.7. In hdfs, I am able to get the file using the following command:
hadoop distcp adl://xxxxxx.azuredatalakestore.net/Data/movies.csv /user/hadoop/movies
The above command successfully copies the file into local HDFS so I believe the hadoop setup is OK.
However, when I try to run the spark.read.format("csv") command, I am getting the following error:
Py4JJavaError: An error occurred while calling o54.load.
: java.lang.NoSuchMethodError: org.apache.hadoop.conf.Configuration.reloadExistingConfigurations()V
at org.apache.hadoop.fs.adl.AdlConfKeys.addDeprecatedKeys(AdlConfKeys.java:126)
at org.apache.hadoop.fs.adl.AdlFileSystem.<clinit>(AdlFileSystem.java:98)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.java:2134)
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2099)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
I tried adding the ADLS jars directly in spark-defaults.conf:
spark.jars /usr/local/hadoop/share/hadoop/tools/lib/azure-data-lake-store-sdk-2.3.1.jar, /usr/local/hadoop/share/hadoop/tools/lib/hadoop-azure-datalake-3.1.1.jar
HADOOP_CLASSPATH refers to the folder where the jars are located according to the spark user:
spark#xxxxx:~$ echo $HADOOP_CLASSPATH /usr/local/hadoop/etc/hadoop/*:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/hadoop/share/hadoop/yarn/*:/usr/local/hadoop/share/hadoop/mapreduce/lib/*:/usr/local/hadoop/share/hadoop/mapreduce/*:/usr/local/hadoop/share/hadoop/tools/lib/*
Any pointers are greatly appreciated.

Related

How to add Spark-excel to PySpark

I'm trying to read xlsx to PySpark and tried with multiple ways to import the library of Spark-excel but I still get errors while reading xlsx file.
I'm using Spark with standalone mode on my Mac.
My code:
# spark configuration
spark_path = "/spark/spark-3.0.1-bin-hadoop2.7"
findspark.init(spark_path)
spark = SparkSession.builder.master("local").appName("Word Count").config("--packages com.crealytics:spark-excel_2.12:0.13.7").getOrCreate()
data_location = "bank_transactions.xlsx"
df = spark.read.format("com.crealytics.spark.excel").load(data_location)
I got the following error:
Py4JJavaError: An error occurred while calling o37.load.
: java.lang.NoClassDefFoundError: scala/Product$class
at com.crealytics.spark.excel.Utils$MapIncluding.<init>(Utils.scala:9)
at com.crealytics.spark.excel.WorkbookReader$.<init>(WorkbookReader.scala:31)
at com.crealytics.spark.excel.WorkbookReader$.<clinit>(WorkbookReader.scala)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:28)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:18)
at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:12)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:344)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:602)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
... 23 more
Solutions:
Download proper spark-excel library, for me it's:
https://mvnrepository.com/artifact/com.crealytics/spark-excel_2.12/0.13.7
Create directory spark_jars in the SPARK_HOME then store the spark-excel package in spark_jars directory
Add the spark_jars to spark.executor.extraClassPath of Spark session:
findspark.init(spark_path)
spark = SparkSession.builder.master("local") \
.appName("Word Count") \
.config("spark.jars.packages","com.crealytics:spark-excel_2.12:0.13.7") \
.getOrCreate()
spark

How do you read a file from Azure Blob w/ Apache Spark without Databricks but with wasbs on Windows 10?

I have azure-storage-8.6.0.jar and hadoop-azure-3.0.1.jar. I keep seeing from other forums that I have to modify the core-site.xml file in the etc folder in hadoop like so https://github.com/hning86/articles/blob/master/hadoopAndWasb.md. I didn't know I even needed to download all of hadoop to run Spark. I thought all I needed was the winutils.exe in hadoop/bin.
spark.read.load(f"wasbs://{container_name}#{storage_account_name}.blob.core.windows.net/{container_name}/myfile.txt" )
Py4JJavaError: An error occurred while calling o53.load.
: java.io.IOException: No FileSystem for scheme: wasbs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:366)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If you want to use pyspark to read CSV file from Azure blob storage on Windows 10, please refer to the following steps
Install pyspark
pip install pyspark
Code (create .py file)
from pyspark.sql import SparkSession
import traceback
try:
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set('fs.azure.account.key.<account name>.blob.core.windows.net',
'<account key>')
df = spark.read.option("header", True).csv(
'wasbs://<container name>#<account name>.blob.core.windows.net/<directory name>/<file name>')
df.show()
except Exception as exp:
print("Exception occurred")
print(traceback.format_exc())
Run code
cd <your python or env path>\Scripts
spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.1,com.microsoft.azure:azure-storage:8.6.5 <your py file path>

Tune Spark and YARN properties in HDP Sandbox

I'm using HDP sandbox 3.1 and performing NLTK on 50K files using spark2 Interpreter and Zeppelin Notebook.
It's a single node setup.
I've given 12GB RAM to Guest System and 6CPUs.
In spark, I'm reading all 50K files in a single RDD Operation, but at 63% my process hangs, and then it leads to ERROR.
Now which Values in Spark and YARN I've to set, so Spark can work in full throttle.
Edit: Each file size is around 3KB
Following is the log when Error occurred
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1848)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1761)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1876)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:162)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
(<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.\n', JavaObject id=o101)

Exception in Pyspark Structured Streaming while reading from Kafka

Environment: Spark 2.4.0
I have included spark-sql-kafka-0-10 jar, and it's of the same version as that of the Spark I am using.
Here's the exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o38.load.
: java.lang.NoClassDefFoundError: org.apache.kafka.common.serialization.ByteArrayDeserializer
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<init>(KafkaSourceProvider.scala:487)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$.<clinit>(KafkaSourceProvider.scala)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:414)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:66)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:209)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:95)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:95)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:33)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:171)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:90)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:508)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:812)
Caused by: java.lang.ClassNotFoundException: org.apache.kafka.common.serialization.ByteArrayDeserializer
at java.net.URLClassLoader.findClass(URLClassLoader.java:610)
at java.lang.ClassLoader.loadClassHelper(ClassLoader.java:937)
at java.lang.ClassLoader.loadClass(ClassLoader.java:882)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:343)
at java.lang.ClassLoader.loadClass(ClassLoader.java:865)
... 20 more
I didn't have kafka-clients jar in my classpath. Adding it fixes the missing class exception
Starting the spark-shell with the packages option will work too:
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0

Getting Error when I ran hive UDF written in Java in pyspark EMR 5.x

I have a Hive UDF written in java and I am trying to use it in pyspark 2.0.0. below are the steps
1. Copy the jar file to EMR
2. started a pyspark job like below
pyspark --jars ip-udf-0.0.1-SNAPSHOT-jar-with-dependencies-latest.jar
3. used the below code access the UDF
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
sc = spark.sparkContext
sqlContext = HiveContext(sc)
sqlContext.sql("create temporary function ip_map as 'com.mediaiq.hive.IPMappingUDF'")
I get the below error:
py4j.protocol.Py4JJavaError: An error occurred while calling o43.sql.
: java.lang.NoSuchMethodError:
org.apache.hadoop.hive.conf.HiveConf.getTimeVar(Lorg/apache/hadoop/hive/conf/HiveConf$ConfVars;Ljava/util/concurrent/TimeUnit;)J
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:76)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:98)
at
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2453)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2465) at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:340)
at
org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:189)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at
org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
at
org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
at
org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
at
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at
org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:214) at
java.lang.Thread.run(Thread.java:745)
You may have built your UDF with a different version of Hive. Be sure to specify the same version of Hive in your pom.xml used to build the jar containing the UDF. See this previous answer, for example.

Resources