Saving Pyspark Dataframe to Azure Storage

Saving Pyspark Dataframe to Azure Storage - azure

I am migrating a proof of concept from AWS / EMR to Azure.
It’s written in python and uses Spark, Hadoop and Cassandra on AWS EMR and S3. It calculates Potential Forward Exposure for a small set of OTC derivatives.
I have one roadblock at present: How do I save a pyspark dataframe to Azure storage?
In AWS / S3 this is quite simple, however I’ve yet to make it work on Azure. I may be doing something stupid!
I've tested out writing files to blob and file storage on Azure, but have yet to find pointers to dataframes.
On AWS, I currently use the following:
npv_dataframe.coalesce(1).saveAsTextFile(output_dir + '/exposure_scenarios/' + str(counterparty))
where output_dir is in the format s3://s3_bucket_name/directory_name
I set up a Data Lake Storage Gen2 storage account and container. I have enabled public access to it.
I have explored various methods e.g:
https://learn.microsoft.com/en-us/python/api/overview/azure/storage-blob-readme?view=azure-python
https://learn.microsoft.com/en-us/azure/storage/common/storage-samples-python?toc=/azure/storage/blobs/toc.json
https://docs.databricks.com/_static/notebooks/data-import/azure-blob-store.html
Write data from pyspark to azure blob? (I believe this is old and that hadoop 3.2.1 comes with abfs support)
Some of these examples use a file-upload pattern but what I wanted was a direct save from a pyspark dataframe.
The test code I used was:
import traceback
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
try:
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set('fs.azure.account.key.#myaccount#.blob.core.windows.net', '#mykey#')
df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()
df \
.coalesce(1) \
.write.format('csv') \
.option('header', True) \
.mode('overwrite') \
.save('wasbs://#mycontainer###myaccount#.blob.core.windows.net/result_csv')
print("Hadoop version: " + spark.sparkContext._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion())
except Exception as exp:
print("Exception occurred")
print(traceback.format_exc())
The example above fails at the df.write - the error is
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
I receive the same error when using spark-submit
spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.1,com.microsoft.azure:azure-storage:8.6.3 ./test.py
I believe this may be a version compatibility problem. I noticed that the hadoop.jars in pyspark were all version 2.7.4, whereas I was referencing the 3.2.1 installation.
I am / was using:
Java 8 (1.8.0_265)
Spark 3.0.0
Hadoop 3.2.1
Python 3.6
Ubuntu 18.04
I ensured all hadoop jars in the Spark directory were the same as in the Hadoop jar directory.
After following another stack trace error I updated the command to: spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.1,com.microsoft.azure:azure-storage:8.6.5 test.py
I then received a different Java error, which looks like a problem with the key??
py4j.protocol.Py4JJavaError: An error occurred while calling o48.save.
: java.lang.NoSuchMethodError: 'org.apache.hadoop.conf.Configuration org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(org.apache.hadoop.conf.Configuration, java.lang.Class)'
at org.apache.hadoop.fs.azure.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:45)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.getAccountKeyFromConfiguration(AzureNativeFileSystemStore.java:989)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.createAzureStorageSession(AzureNativeFileSystemStore.java:1078)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.initialize(AzureNativeFileSystemStore.java:543)
at org.apache.hadoop.fs.azure.NativeAzureFileSystem.initialize(NativeAzureFileSystem.java:1344)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource.planForWritingFileFormat(DataSource.scala:424)
at org.apache.spark.sql.execution.datasources.DataSource.planForWriting(DataSource.scala:524)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:832)
Also, after adding the Azure account secure key to the hadoop config, if I try:
hdfs dfs -ls wasbs://CONTAINER#ACCOUNT.blob.core.windows.net/
I receive the error: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
Any help appreciated! Bit stuck for ideas. It also seems that, relative to AWS, there are few solved posts about Azure storage / Dataframe issues.

According to my test, we can use the package com.microsoft.azure:azure-storage:8.6.3 to upload files to Azure blob in spark.
For example
I am using
Java 8 (1.8.0_265) Spark 3.0.0 Hadoop 3.2.0 Python 3.6.9 Ubuntu 18.04
My code
import traceback
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
try:
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext._jsc.hadoopConfiguration()
conf.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set('fs.azure.account.key.jimtestdiag924.blob.core.windows.net', '')
df = spark.createDataFrame(["10", "11", "13"], StringType()).toDF("age")
df.show()
df \
.coalesce(1) \
.write.format('csv') \
.option('header', True) \
.mode('overwrite') \
.save('wasbs://testupload#<account name>.blob.core.windows.net/result_csv')
print("Hadoop version: " + spark.sparkContext._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion())
except Exception as exp:
print("Exception occurred")
print(traceback.format_exc())
My command
spark-submit --packages org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3 spark.py

I resolved the issue by changing the storage account to a Blobstorage type, rather than Storage Gen2. Windows Azure Storage Blob (WASB) driver is unsupported with Data Lake Storage Gen2.

Related

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

I am trying to read data from hudi but getting below error
Caused by: java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html
I am able to read the data from Hudi using my jupyter notebook using below commands
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).enableHiveSupport().getOrCreate
import org.apache.hudi.DataSourceReadOptions
val hudiIncQueryDF = spark.read.format("hudi").load(
"path"
)
import org.apache.spark.sql.functions._
hudiIncQueryDF.filter(col("column_name")===lit("2022-06-01")).show(10,false)
This jupyter notebook was opened using a cluster which was created with one of the below properties
--properties spark:spark.jars="gs://rdl-stage-lib/hudi-spark3-bundle_2.12-0.10.0.jar" \
however, when I try to run the job using spark-submit with the same cluster, I get the error above.
I have also added spark.serializer=org.apache.spark.serializer.KryoSerializer in my job properties. Not sure what's the issue.

As your application is dependent on hudi jar, hudi itself has some dependencies, when you add the maven package to your session, spark will install hudi jar and its dependencies, but in your case, you provide only the hudi jar file from a GCS bucket.
You can try this property instead:
--properties spark:spark.jars.packages="org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0" \
Or directly from you notebook:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.config(
"spark.sql.catalogImplementation", "hive"
).config(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config(
"spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog"
).config(
"spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
).config(
"spark.jars.package", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0"
).enableHiveSupport().getOrCreate

org.apache.spark.SparkException: Writing job aborted on Databricks

I have used Databricks to ingest data from Event Hub and process it in real time with Pyspark Streaming. The code is working fine, but after this line:
df.writeStream.trigger(processingTime='100 seconds').queryName("myquery")\
.format("console").outputMode('complete').start()
I'm getting the following error:
org.apache.spark.SparkException: Writing job aborted.
Caused by: java.io.InvalidClassException: org.apache.spark.eventhubs.rdd.EventHubsRDD; local class incompatible: stream classdesc
I have read that this could be due to low processing power, but I am using a Standard_F4 machine, standard cluster mode with autoscaling enabled.
Any ideas?

This looks like a JAR issue. Go to your JAR's folder in spark and check if you have multiple jars for azure-eventhubs-spark_XXX.XX. I think you've downloaded different versions of it and placed it there, you should remove any JAR with that name from your collection. This error may also occur if your JAR version is incompatible with other JAR's. Try adding spark jars using spark config.
spark = SparkSession \
.builder \
.appName('my-spark') \
.config('spark.jars.packages', 'com.microsoft.azure:azure-eventhubs-spark_2.11:2.3.12') \
.getOrCreate()
This way spark will download JAR files through maven.

Why does spark-submit ignore the package that I include as part of the configuration of my spark session?

I am trying to include the org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 package as part of my spark code (via the SparkSession Builder). I understand that I can download the JAR myself and include it but I would like to figure out why the following is not working as expected:
from pyspark.sql import SparkSession
import pyspark
import json
if __name__ == "__main__":
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3") \
.getOrCreate()
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "first_topic") \
.load() \
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
query = df \
.writeStream \
.format("console") \
.outputMode("update") \
.start()
When I run the job:
spark-submit main.py
I receive the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o48.load.
: org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If I instead include the packages via the --packages flag, the dependencies are downloaded and the code runs as expected:
spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3 main.py
The code also works if I open the PySpark shell and paste the code above. Is there a reason that the spark-submit ignores the configuration?

I think that for configurations like "spark.jars.packages", these should be configured either in spark-defaults or passed by command-line arguments, setting it in the runtime shouldn't work.
Against better judgement
I remember some people claimed something like this worked for them, but I would say that the dependency is already somewhere there (installed in local repo), just loaded.
conf = pyspark.SparkConf()
conf.set("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.3")
spark = SparkSession.builder \
.master("local") \
.appName("App Name") \
.config(conf = conf) \
.getOrCreate()

When you run spark-submit, it already creates a SparkSession that is reused by your code - thus you have to provide everything through spark-submit.
However, you do not need to actually use spark-submit to run your Spark code. Assuming your main method looks like this:
def main():
spark = SparkSession.builder.config(...).getOrCreate()
# your spark code below
...
You can run this code just via python:
> python ./my_spark_script.py
This will run your program correctly

I faced same problem, after google I found that link
https://issues.apache.org/jira/browse/SPARK-21752
According to #srowen Sean R. Owen "At that point, your app has already launched. You can't change the driver classpath."

DataProc Avro Version Causing Error on Image v1.0.0

We are running a few dataproc jobs with dataproc image 1.0 and spark-redshift.
We have two clusters, here are some details:
Cluster A -> Runs PySpark Streaming job, last created 2016. Jul 15. 11:27:12 AEST
Cluster B -> Runs PySpark Batch jobs, the cluster is created everytime the job is run and teardown afterwards.
A & B runs the same code base, use the same init script, same node types etc.
Since sometime last Friday (2016-08-05 AEST), our code stopped working on cluster B with the following error, while cluster A is running without issues.
The following code can reproduce the issue on Cluster B (or any new cluster with image v1.0.0) while it runs fine on cluster A.
Sample PySpark Code:
from pyspark import SparkContext, SQLContext
sc = SparkContext()
sql_context = SQLContext(sc)
rdd = sc.parallelize([{'user_id': 'test'}])
df = rdd.toDF()
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "FOO")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "BAR")
df\
.write\
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://foo.ap-southeast-2.redshift.amazonaws.com/bar") \
.option("dbtable", 'foo') \
.option("tempdir", "s3n://bar") \
.option("extracopyoptions", "TRUNCATECOLUMNS") \
.mode("append") \
.save()
The above code fails in both of the following situations on Cluster B, while running fine on A. note that the RedshiftJDBC41-1.1.10.1010.jar is created via cluster init script.
Running in interactive mode on master node:
PYSPARK_DRIVER_PYTHON=ipython pyspark \
--verbose \
--master "local[*]"\
--jars /usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar \
--packages com.databricks:spark-redshift_2.10:1.0.0
Submit the job via gcloud dataproc
gcloud --project foo \
dataproc jobs submit pyspark \
--cluster bar \
--properties ^#^spark.jars.packages=com.databricks:spark-redshift_2.10:1.0.0#spark.jars=/usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar \
foo.bar.py
The error it produces (Trace):
2016-08-08 06:12:23 WARN TaskSetManager:70 - Lost task 6.0 in stage 45.0 (TID 121275, foo.bar.internal):
java.lang.NoSuchMethodError: org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
at org.apache.avro.mapreduce.AvroKeyRecordWriter.<init>(AvroKeyRecordWriter.java:55)
at org.apache.avro.mapreduce.AvroKeyOutputFormat$RecordWriterFactory.create(AvroKeyOutputFormat.java:79)
at org.apache.avro.mapreduce.AvroKeyOutputFormat.getRecordWriter(AvroKeyOutputFormat.java:105)
at com.databricks.spark.avro.AvroOutputWriter.<init>(AvroOutputWriter.scala:82)
at com.databricks.spark.avro.AvroOutputWriterFactory.newInstance(AvroOutputWriterFactory.scala:31)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:129)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:255)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-08-08 06:12:24 ERROR YarnScheduler:74 - Lost executor 63 on kinesis-ma-sw-o7he.c.bupa-ma.internal: Container marked as failed: container_1470632577663_0003_01_000065 on host: kinesis-ma-sw-o7he.c.bupa-ma.internal. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1470632577663_0003_01_000065
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
SparkRedshift:1.0.0 requires com.databricks.spark-avro:2.0.1, which requires org.apache.avro:1.7.6.
Upon checking the version of org.apache.avro.generic.GenericData on Cluster A:
root#foo-bar-m:/home/foo# spark-shell \
> --verbose \
> --master "local[*]" \
> --deploy-mode client \
> --packages com.databricks:spark-redshift_2.10:1.0.0 \
> --jars "/usr/lib/hadoop/lib/RedshiftJDBC41-1.1.10.1010.jar"
It produces (Trace):
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#496a514f
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: java.security.CodeSource = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.5-hadoop2.jar <no signer certificates>)
While running the same command on Cluster B:
scala> import org.apache.avro.generic._
import org.apache.avro.generic._
scala> val c = GenericData.get()
c: org.apache.avro.generic.GenericData = org.apache.avro.generic.GenericData#72bec302
scala> c.getClass.getProtectionDomain().getCodeSource()
res0: java.security.CodeSource = (file:/usr/lib/hadoop/lib/bigquery-connector-0.7.7-hadoop2.jar <no signer certificates>)
Screenshot of Env on Cluster B. (Apologies for all the redactions).
We've tried method described on here and here without any success.
This is really frustrating as the DataProc updates the image content without bumping the release version as the complete opposite of immutable releases. Now our code is broke and there is no way we could roll back to the previous version.

Sorry for the trouble! It's certainly not intended for breaking changes to occur within an image version. Note that subminor versions are rolled out "under the hood" for non-breaking bug fixes and Dataproc-specific patches.
You can revert to using the 1.0.* version from before last week by simply specifying --image-version 1.0.8 when deploying clusters from the command-line:
gcloud dataproc clusters create --image-version 1.0.8
Edit: For additional clarification, we've investigated the Avro versions in question and verified that Avro version numbers actually did not change in any recent subminor Dataproc release. The core issue is that Hadoop itself has had a latent bug where Hadoop itself brings avro-1.7.4 under /usr/lib/hadoop/lib/ and Spark uses avro-1.7.7. Coincidentally Google's bigquery connectory also uses avro-1.7.7 but this turns out to be orthogonal to the known Spark/Hadoop problem with 1.7.4 vs 1.7.7. The recent image update was deemed nonbreaking because versions in fact did not change, but classloading ordering changed in a nondeterministic way where Hadoop's bad avro version used to be hidden from the Spark job by pure luck, and is no longer accidentally hidden in the latest image.
Dataproc's preview image currently includes a fix to the avro version in the Hadoop layer which should make it into any future Dataproc 1.1 version when it comes out; you might want to consider trying the preview version to see if Spark 2.0 is a seamless transition.

Spark submit throws error while using Hive tables

i have a strange error, i am trying to write data to hive, it works well in spark-shell, but while i am using spark-submit, it throwing database/table not found in default error.
Following is the coding i am trying to write in spark-submit , i am using custom build of spark 2.0.0
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.table("spark_schema.iris_ori")
Following is the command i am using,
/home/ec2-user/Spark_Source_Code/spark/bin/spark-submit --class TreeClassifiersModels --master local[*] /home/ec2-user/Spark_Snapshots/Spark_2.6/TreeClassifiersModels/target/scala-2.11/treeclassifiersmodels_2.11-1.0.3.jar /user/ec2-user/Input_Files/defPath/iris_spark SPECIES~LBL+PETAL_LENGTH+PETAL_WIDTH RAN_FOREST 0.7 123 12
Following is the Error,
16/05/20 09:05:18 INFO SparkSqlParser: Parsing command: spark_schema.measures_20160520090502
Exception in thread "main" org.apache.spark.sql.AnalysisException: Database 'spark_schema' does not exist;
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:37)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:195)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:360)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:464)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:458)
at TreeClassifiersModels$.main(TreeClassifiersModels.scala:71)
at TreeClassifiersModels.main(TreeClassifiersModels.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:726)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:183)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:208)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:122)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

The issue was because of the deprecation happened on Spark Version 2.0.0. Hive Context was deprecated in Spark 2.0.0. To read/Write Hive tables on Spark 2.0.0 we need to use Spark session as follows.
val sparkSession = SparkSession.withHiveSupport(sc)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Saving Pyspark Dataframe to Azure Storage - azure

I resolved the issue by changing the storage account to a Blobstorage type, rather than Storage Gen2. Windows Azure Storage Blob (WASB) driver is unsupported with Data Lake Storage Gen2.

Related

java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html

org.apache.spark.SparkException: Writing job aborted on Databricks

Why does spark-submit ignore the package that I include as part of the configuration of my spark session?

DataProc Avro Version Causing Error on Image v1.0.0

Spark submit throws error while using Hive tables

Categories

Resources