Read /Write delta lake tables on S3 using AWS Glue jobs - apache-spark

I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined"
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0").getOrCreate()
from delta.tables import *
data = spark.range(0, 5)
data.write.format("delta").save("S3://databricksblaze/data")
Added the necessary Jar ( delta-core_2.11-0.6.0.jar ) too in the dependency jars of the glue job.
Can anyone help me on this
Thanks

I have had success in using Glue + Deltalake. I added the Deltalake dependencies to the section "Dependent jars path" of the Glue job.
Here you have the list of them (I am using Deltalake 0.6.1):
com.ibm.icu_icu4j-58.2.jar
io.delta_delta-core_2.11-0.6.1.jar
org.abego.treelayout_org.abego.treelayout.core-1.0.3.jar
org.antlr_antlr4-4.7.jar
org.antlr_antlr4-runtime-4.7.jar
org.antlr_antlr-runtime-3.5.2.jar
org.antlr_ST4-4.0.8.jar
org.glassfish_javax.json-1.0.4.jar
Then in your Glue job you can use the following code:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
sc.addPyFile("io.delta_delta-core_2.11-0.6.1.jar")
from delta.tables import *
glueContext = GlueContext(sc)
spark = glueContext.spark_session
delta_path = "s3a://your_bucket/folder"
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save(delta_path)
deltaTable = DeltaTable.forPath(spark, delta_path)

You need to pass the additional configuration properties
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Setting spark.jars.packages in SparkSession.builder.config doesn't work. spark.jars.packages is handled by org.apache.spark.deploy.SparkSubmitArguments/SparkSubmit. So it must be passed as an argument of the spark-submit or pyspark script. When SparkSession.builder.config is called, SparkSubmit has done its job. So spark.jars.packages is no-op at this moment. See https://issues.apache.org/jira/browse/SPARK-21752 for more details.

Related

java.lang.NoSuchMethodError: org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders while reading from Azure Blob Storage

I am trying to read a CSV file stored in Azure Storage Account. For that, I have installed a spark on my Virtual Machine and trying to read a CSV file in a dataframe from pyspark.
I read somewhere how to do that and I followed the steps and copied the latest hadoop-azure & azure-storage JAR files on my /jar directories. Then, I came up with this error:-
NoClassDefFoundError: org/apache/hadoop/fs/StreamCapabilities
I searched for this error and found that I need to refer hadoop-azure-2.8.5.jar instead of latest hadoop-azure JAR. So, I replaced this JAR with the latest hadoop-azure jar and again executed my pyspark code.
After executing my code, I encountered with another error: -
: java.lang.NoSuchMethodError:
org.apache.hadoop.security.ProviderUtils.excludeIncompatibleCredentialProviders(Lorg/apache/hadoop/conf/Configuration;Ljava/lang/Class;)Lorg/apache/hadoop/conf/Configuration;
Also, below is my pyspark code: -
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
storage_account_name = "<storage_account_name>"
storage_account_access_key = "<storage_account_access_key>"
spark.conf.set("fs.azure.account.key." + storage_account_name + ".blob.core.windows.net",storage_account_access_key)
spark._jsc.hadoopConfiguration().set("fs.wasbs.impl","org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark._jsc.hadoopConfiguration().set("fs.azure.account.key.my_account.blob.core.windows.net", "storage_account_access_key")
df = spark.read.format("csv").option("inferSchema", "true").load("wasbs://<container_name>#<storage_account_name>.blob.core.windows.net/<path_to_csv>/sample_file.csv")
df.show()
I searched for this and tried various hadoop-azure JAR versions. The one which worked for me was hadoop-azure-2.7.0.jar.
With this JAR version, I was able to read the CSV file from Blob storage.

How to Stop or Delete HiveContext in Pyspark?

I'm facing the following problem:
def my_func(table, usr, psswrd):
from pyspark import SparkContext, SQLContext, HiveContext, SparkConf
sconf = SparkConf()
sconf.setAppName('TEST')
sconf.set("spark.master", "local[2]")
sc = SparkContext(conf=sconf)
hctx = HiveContext(sc)
## Initialize variables
df = hctx.read.format("jdbc").options(url=url,
user=usr,
password=psswd,
driver=driver,
dbtable=table).load()
pd_df = df.toPandas()
sc.stop()
return pd_df
The problem here is the persistence of HiveContext (i.e if I do hctx._get_hive_ctx() it returns JavaObject id=Id)
So if I use my_func several times in the same script it will failed at the second time.
I would try remove the HiveContext which is apparently not deleted when I stop the SparkContext.
Thanks
Removing HiveContext is not possible as some state persists after sc.stop() that makes it not work in some cases.
But you could have a workaround for this (caution!! it's dangerous) if it's feasible for you. You have to delete the metastore_db everytime you start/stop your sparkContext. Again, see if it's feasible for you. The code is Java is below (in your case you have to modify it in Python).
File hiveLocalMetaStorePath = new File("metastore_db");
FileUtils.deleteDirectory(hiveLocalMetaStorePath);
You can better understand it from the following links.
https://issues.apache.org/jira/browse/SPARK-10872
https://issues.apache.org/jira/browse/SPARK-11924

Pyspark reads csv - NameError: name 'spark' is not defined

I am trying to run the following code in databricks in order to call a spark session and use it to open a csv file:
spark
fireServiceCallsDF = spark.read.csv('/mnt/sf_open_data/fire_dept_calls_for_service/Fire_Department_Calls_for_Service.csv', header=True, inferSchema=True)
And I get the following error:
NameError:name 'spark' is not defined
Any idea what might be wrong?
I have also tried to run:
from pyspark.sql import SparkSession
But got the following in response:
ImportError: cannot import name SparkSession
If it helps, I am trying to follow the following example (you will understand better if you watch it from from 17:30 on):
https://www.youtube.com/watch?v=K14plpZgy_c&list=PLIxzgeMkSrQ-2Uizm4l0HjNSSy2NxgqjX
I got it worked by using the following imports:
from pyspark import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession, SQLContext
I got the idea by looking into the pyspark code as I found read csv was working in the interactive shell.
Please note the example code your are using is for Spark version 2.x
"spark" and "SparkSession" are not available on Spark 1.x. The error messages you are getting point to a possible version issue (Spark 1.x).
Check the Spark version you are using.

howto add hive properties at runtime in spark-shell

How do you set a hive property like: hive.metastore.warehouse.dir at runtime? Or at least a more dynamic way of setting a property like the above, than putting it in a file like spark_home/conf/hive-site.xml
I faced the same issue and for me it worked by setting Hive properties from Spark (2.4.0). Please find below all the options through spark-shell, spark-submit and SparkConf.
Option 1 (spark-shell)
spark-shell --conf spark.hadoop.hive.metastore.warehouse.dir=some_path\metastore_db_2
Initially I tried with spark-shell with hive.metastore.warehouse.dir set to some_path\metastore_db_2. Then I get the next warning:
Warning: Ignoring non-spark config property:
hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
Although when I create a Hive table with:
bigDf.write.mode("overwrite").saveAsTable("big_table")
The Hive metadata are stored correctly under metastore_db_2 folder.
When I use spark.hadoop.hive.metastore.warehouse.dir the warning disappears and the results are still saved in the metastore_db_2 directory.
Option 2 (spark-submit)
In order to use hive.metastore.warehouse.dir when submitting a job with spark-submit I followed the next steps.
First I wrote some code to save some random data with Hive:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("metastore_test").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
var dfA = spark.createDataset(Seq(
(1, "val1", "p1"),
(2, "val1", "p2"),
(3, "val2", "p3"),
(3, "val3", "p4"))).toDF("id", "value", "p")
dfA.write.mode("overwrite").saveAsTable("metastore_test")
spark.sql("select * from metastore_test").show(false)
Next I submitted the job with:
spark-submit --class org.tests.Main \
--conf spark.hadoop.hive.metastore.warehouse.dir=C:\winutils\hadoop-2.7.1\bin\metastore_db_2
spark-scala-test_2.11-0.1.jar
The metastore_test table was properly created under the C:\winutils\hadoop-2.7.1\bin\metastore_db_2 folder.
Option 3 (SparkConf)
Via SparkSession in the Spark code.
val sparkConf = new SparkConf()
.setAppName("metastore_test")
.set("spark.hadoop.hive.metastore.warehouse.dir", "C:\\winutils\\hadoop-2.7.1\\bin\\metastore_db_2")
.setMaster("local")
This attempt was successful as well.
The question which still remains is why I have to extend the property with spark.hadoop in order to work as expected?

Reading Avro into spark using spark-avro

I'm not being able to read spark files using the spark-avro library. Here are the steps I took:
Got the jar from: http://mvnrepository.com/artifact/com.databricks/spark-avro_2.10/0.1
Invoked spark-shell using spark-shell --jars avro/spark-avro_2.10-0.1.jar
Executed commands as given in the git readme:
import com.databricks.spark.avro._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val episodes = sqlContext.avroFile("episodes.avro")
The action sqlContext.avroFile("episodes.avro") fails with the following error:
scala> val episodes = sqlContext.avroFile("episodes.avro")
java.lang.IncompatibleClassChangeError: class com.databricks.spark.avro.AvroRelation has interface org.apache.spark.sql.sources.TableScan as super class
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
My bad. The readme clearly says:
Versions
Spark changed how it reads / writes data in 1.4, so please use the correct version of this dedicated for your spark version
1.3 -> 1.0.0
1.4+ -> 1.1.0-SNAPSHOT
I used spark:1.3.1 and spark-avro: 1.1.0. When I used spark-avro: 1.0.0, it worked.
Since spark-avro module is external, there is no .avro API in DataFrameReader or DataFrameWriter.
To load/save data in Avro format, you need to specify the data source option format as avro.
Example:
val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro")
usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro")
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder()
.appName(appName).master(master).getOrCreate()
val sqlContext = spark.sqlContext
val episodes = sqlContext.read.format("com.databricks.spark.avro")
.option("header","true")
.option("inferSchema","true")
.load("episodes.avro")
episodes.show(10)

Resources