PySpark accessing glue data catalog - apache-spark

I am having trouble being able to accessing a table in the Glue Data Catalog using pySpark in Hue/Zeppelin on EMR. I have tried both emr-5.13.0 and emr-5.12.1.
I tried following https://github.com/aws-samples/aws-glue-samples/blob/master/examples/data_cleaning_and_lambda.md
but when trying to import the GlueContext it errors saying No module named awsglue.context.
Another note is that when doing a spark.sql("SHOW TABLES").show() it comes up empty for Hue/Zeppelin but when using the pyspark shell on the master node I am able to see and query the table from the Glue Data Catalog.
Any help is much appreciated, thanks!

Ok, I spent some time to simulate the issue, so I spinned up an EMR, with "Use AWS Glue Data Catalog for table metadata" enabled. After enabling web connections, and in zeppelin I issued a show databases command, and it worked fine. Please find herewith the command & output from Zeppelin:
%spark
spark.sql("show databases").show
+-------------------+
|airlines-historical|
| default|
| glue-poc-tpch|
| legislator-new|
| legislators|
| nursinghomedb|
| nycitytaxianalysis|
| ohare-airport-2006|
| payments|
| s100g|
| s1g|
| sampledb|
| testdb|
| tpch|
| tpch_orc|
| tpch_parquet|
+-------------------+
As far as your other issue of "No module named awsglue.context", I think it may not be possible with an EMR commissioned Zeppelin. I think the only way, an awsglue.context can be accessed / used, is via a Glue Devendpoint that you may need to set up in AWS Glue, and then, use an glue jupyter notebook or a locally setup Zeppelin notebook connected to glue development endpoint.
Am not sure if glue context can be directly accessed from an EMR commissioned Zeppelin notebook, maybe am wrong.
You can still access the glue catalog, since EMR provides you with an option for the same, so you can access the databases and do your ETL jobs.
Thanks.

Please check out the details in this link from AWS, and see if the EMR is configured correctly as recommended (Configure Glue Catalog in EMR). Also ensure that appropriate permissions are granted to access AWS Glue catalog. Details are in the attached link. Hope this helps.

you can use the below function to check the availability of databases in glue
def isDatabasePresent(database_name):
"""
check if the glue database exists
:return: Boolean
"""
client = get_glue_client()
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
if database_name == databaseDict['Name']:
return True

Related

Azure Databricks external Hive Metastore

I checked the [documentation][1] about usage of Azure Databricks external Hive Metastore (Azure SQL database).
I was able to download jars and place them into /dbfs/hive_metastore_jar
My next step is to run cluster with Init file:
# Hive-specific configuration options.
# spark.hadoop prefix is added to make sure these Hive specific options propagate to the metastore client.
# JDBC connect string for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionURL jdbc:sqlserver://<host>.database.windows.net:1433;database=<database> #should I add more parameters?
# Username to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionUserName admin
# Password to use against metastore database
spark.hadoop.javax.jdo.option.ConnectionPassword p#ssword
# Driver class name for a JDBC metastore
spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
# Spark specific configuration options
spark.sql.hive.metastore.version 2.7.3 #I am not sure about this
# Skip this one if <hive-version> is 0.13.x.
spark.sql.hive.metastore.jars /dbfs/hive_metastore_jar
I've uploaded ini file to the DBMS and launch cluster. It was failed to read ini. Something wrong..
[1]: https://learn.microsoft.com/en-us/azure/databricks/data/metastores/external-hive-metastore
I solved this for now. The problems I faced:
I didn't copy Hive jars to the local cluster. This is important, I couldn't refer to the DBMS and should refer spark.sql.hive.metastore.jars to the local copy of Hive. With INI script I can copy them.
connection was good. I also used the Azure template with Vnet, it is more preferable. Then I allow traffic for Azure SQL from my Vnet with databricks.
last issue - I had to create Hive schema before start databricks by copy and run DDL from Git with Hive version 1.2 I deployed it into Azure SQL Database and then I was good to go.
There is a useful notebook with steps to download jars. It is downloading jars to tmp then we should copy it to the own folder. Finally, within cluster creation we should refer to INI script that has all parameters. It has the step of copy jars from DBFS to local file system of cluster.
// This example is for an init script named `external-metastore_hive121.sh`.
dbutils.fs.put(
"dbfs:/databricks/scripts/external-metastore_hive121.sh",
"""#!/bin/sh
|# A temporary workaround to make sure /dbfs is available.
|sleep 10
|# Copy metastore jars from DBFS to the local FileSystem of every node.
|cp -r /dbfs/metastore_jars/hive-v1_2/* /databricks/hive_1_2_1_metastore_jars
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options.
| # spark.hadoop prefix is added to make sure these Hive specific options will propagate to the metastore client.
| # JDBC connect string for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://host--name.database.windows.net:1433;database=tcdatabricksmetastore_dev;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net"
|
| # Username to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "admin"
|
| # Password to use against metastore database
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "P#ssword"
|
| # Driver class name for a JDBC metastore
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|
| # Spark specific configuration options
| "spark.sql.hive.metastore.version" = "1.2.1"
| # Skip this one if ${hive-version} is 0.13.x.
| "spark.sql.hive.metastore.jars" = "/databricks/hive_1_2_1_metastore_jars/*"
|}
|EOF
|""".stripMargin,
overwrite = true)
The command will create a file in DBFS and we will use it as a reference for the cluster creation.
According to the documentation, we should use config:
datanucleus.autoCreateSchema true
datanucleus.fixedDatastore false
In order to create the Hive DDL. It didn't work for me, that's why I used git and create schema and tables myself.
You can test that all works with command:
%sql show databases

How to read stderr logs from AWS logs

I am using EMR steps to run my jobs.
Typically when I want to analyze the performance of a job or to understand why it failed, I look at the spark history server for DAG visualizations, and job errors, etc.
For example, if the job failed due to heap error, or Fetchfailed, etc, I can see it clearly specified in the spark history server.
However, I can't seem to be able to find such descriptions when I look at the stderr log files that are written to the LOG URI S3 bucket.
Is there a way to obtain such information?
I use pyspark and set the log level to
sc = spark.sparkContext
sc.setLogLevel('DEBUG')
Any insight as to what I am doing wrong?
I haven't really tested this but as it's a bit long to fit in a comment, I post it here as an answer.
Like pointed out in my comment, the logs you're viewing using Spark History Server UI aren't the same as the Spark driver logs that are saved to S3 from EMR.
To get the spark history server logs written into S3, you'll have to add some additional configuration to your cluster. These configuration options are described in the section Monitoring and Instrumentation of Spark documentation.
In AWS EMR, you could try to add something like this into your cluster configuration:
...
{
'Classification': 'spark-defaults',
'Properties': {
'spark.eventLog.dir': 's3a://your_bucket/spark_logs',
'spark.history.fs.logDirectory': 's3a://your_bucket/spark_logs',
'spark.eventLog.enabled': 'true'
}
}
...
I found this interesting post which describes how to set this for Kubernetes cluster, you may want to check it for further details.

What does "avoid multiple Kudu clients per cluster" mean?

I am looking at kudu's documentation.
Below is a partial description of kudu-spark.
https://kudu.apache.org/docs/developing.html#_avoid_multiple_kudu_clients_per_cluster
Avoid multiple Kudu clients per cluster.
One common Kudu-Spark coding error is instantiating extra KuduClient objects. In kudu-spark, a KuduClient is owned by the KuduContext. Spark application code should not create another KuduClient connecting to the same cluster. Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.
To diagnose multiple KuduClient instances in a Spark job, look for signs in the logs of the master being overloaded by many GetTableLocations or GetTabletLocations requests coming from different clients, usually around the same time. This symptom is especially likely in Spark Streaming code, where creating a KuduClient per task will result in periodic waves of master requests from new clients.
Does this mean that I can only run one kudu-spark task at a time?
If I have a spark-streaming program that is always writing data to the kudu,
How can I connect to kudu with other spark programs?
In a non-Spark program you use a KUDU Client for accessing KUDU. With a Spark App you use a KUDU Context that has such a Client already, for that KUDU cluster.
Simple JAVA program requires a KUDU Client using JAVA API and maven
approach.
KuduClient kuduClient = new KuduClientBuilder("kudu-master-hostname").build();
See http://harshj.com/writing-a-simple-kudu-java-api-program/
Spark / Scala program of which many can be running at the same time
against the same Cluster using Spark KUDU Integration. Snippet
borrowed from official guide as quite some time ago I looked at this.
import org.apache.kudu.client._
import collection.JavaConverters._
// Read a table from Kudu
val df = spark.read
.options(Map("kudu.master" -> "kudu.master:7051", "kudu.table" -> "kudu_table"))
.format("kudu").load
// Query using the Spark API...
df.select("id").filter("id >= 5").show()
// ...or register a temporary table and use SQL
df.registerTempTable("kudu_table")
val filteredDF = spark.sql("select id from kudu_table where id >= 5").show()
// Use KuduContext to create, delete, or write to Kudu tables
val kuduContext = new KuduContext("kudu.master:7051", spark.sparkContext)
// Create a new Kudu table from a dataframe schema
// NB: No rows from the dataframe are inserted into the table
kuduContext.createTable("test_table", df.schema, Seq("key"),
new CreateTableOptions()
.setNumReplicas(1)
.addHashPartitions(List("key").asJava, 3))
// Insert data
kuduContext.insertRows(df, "test_table")
See https://kudu.apache.org/docs/developing.html
The more clear statement of "avoid multiple Kudu clients per cluster" is "avoid multiple Kudu clients per spark application".
Instead, application code should use the KuduContext to access a KuduClient using KuduContext#syncClient.

Option to enable glue catalog for Presto/Spark on EMR using Terraform

Wanted to know if there's support to enable aws glue catalog for Presto/Spark when running on EMR.Could not find anything in the documentation.
From the link provided by the answer above, i was able to model terraform code as follows-:
Create a configuration.json.tpl with the following content
[{
"Classification": "spark-hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]
Create a template from the above template in your terraform code
data "template_file" "cluster_1_configuration" {
template = "${file("${path.module}/templates/configuration.json.tpl")}"
}
And then setup the cluster as such-:
resource "aws_emr_cluster" "cluster_1" {
name = "${var.cluster_name}-1"
release_label = "emr-5.21.0"
applications = ["Spark", "Zeppelin", "Hadoop","Sqoop"]
log_uri = "s3n://${var.cluster_name}/logs/"
configurations = "${data.template_file.cluster_1_configuration.rendered}"
...
}
Glue should work now from Spark, you can verify this by calling spark.catalog.listDatabases().show() from spark-shell.
The following AWS documents discuss about using Apache Spark and Hive on Amazon EMR with the AWS Glue Data Catalog, and also using AWS Glue Data Catalog as the default Hive metastore for Presto (Amazon EMR release version 5.10.0 and later). Hope you are looking for this?
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-glue.html and
and
https://aws.amazon.com/about-aws/whats-new/2017/08/use-apache-spark-and-hive-on-amazon-emr-with-the-aws-glue-data-catalog/
Also please check this SO link for some glue catalog configurations on EMR:
Issue with AWS Glue Data Catalog as Metastore for Spark SQL on EMR

Accessing Spark RDDs from a web browser via thrift server - java

We have processed our data using Spark 1.2.1 with Java and stored in Hive tables. We want to access this data as RDDs from an web browser.
I read documentation and I understood the steps to do the task.
I am unable to find the way to interact with Spark SQL RDDs via thrift server. Examples I found have belw line in the code and I am not find the class for this in Spark 1.2.1 java API docs.
HiveThriftServer2.startWithContext
In github i saw scala examples using
import org.apache.spark.sql.hive.thriftserver , but I dont see this in Java API docs. Not sure if I am missing something.
Did anybody had luck with accessing Spark SQL RDDs from a browser via thrift? Can you post the code snippet. We are using Java.
I've got most of this working. Lets dissect each part of it: (References at bottom of post)
HiveThriftServer2.startWithContext is defined in Scala. I was never able to access it from Java or from Python using Py4j, and am no JVM expert, but I ended up switching to Scala. This may have something to do with the annotation #DeveloperApi . This is how I imported it Scala in Spark 1.6.1:
import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
For anyone reading this and not using Hive, a Spark SQL context won't do, and you need a hive context. However, the HiveContext constructor requires a Java spark context, not a scala one.
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.hive.HiveContext
var hiveContext = new HiveContext(JavaSparkContext.toSparkContext(sc))
Now start the thrift server
HiveThriftServer2.startWithContext(hiveContext)
// Yay
Next, we need to make our RDDs available as SQL tables. First, we have to convert them into Spark SQL DataFrames:
val someDF = hiveContext.createDataFrame(someRDD)
Then, we need to turn them into Spark SQL tables. You do this by persisting them to Hive, or making the RDD available as a temporary table.
Persist to Hive:
// Deprecated since Spark 1.4, to be removed in Spark 2.0:
someDF.saveAsTable("someTable")
// Up-to-date at time of writing
someDF.write().saveAsTable("someTable")
Or, use a temporary table:
// Use the Data Frame as a Temporary Table
// Introduced in Spark 1.3.0
someDF.registerTempTable("someTable")
Note - temporary tables are isolated to an SQL session.
Spark's hive thrift server is multi-session by default
in version 1.6 (one session per connection). Therefore,
for clients to access temporary tables you've registered,
you'll need to set the option spark.sql.hive.thriftServer.singleSession to true
You can test this by querying the tables in beeline, a command line utility for interacting with the hive thrift server. It ships with Spark.
Finally, you need a way of accessing the hive thrift server from the browser. Thanks to its awesome developers, it has an HTTP mode, so if you want to build a web app, you can use the thrift protocol over AJAX requests from the browser. A simpler strategy might be to create an IPython notebook, and use pyhive to connect to the thrift server.
Data Frame Reference:
https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/DataFrame.html
singleSession option pull request:
https://mail-archives.apache.org/mod_mbox/spark-commits/201511.mbox/%3Cc2bd1313f7ca4e618ec89badbd8f9f31#git.apache.org%3E
HTTP mode and beeline howto:
https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Pyhive:
https://github.com/dropbox/PyHive
HiveThriftServer2 startWithContext definition:
https://github.com/apache/spark/blob/6b1a6180e7bd45b0a0ec47de9f7c7956543f4dfa/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/HiveThriftServer2.scala#L56-73
Thrift is JDBC/ODBC server.
You can connect to it via JDBC/ODBC connections and access content through the HiveDriver.
You can not get RDDs back from it, because HiveContext is not available.
What you refered to is an experimental feature not available for Java.
As a workaround, you could re-parse the results and create your structures for your client.
For example:
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String hiveConnectionString = "jdbc:hive2://YourHiveServer:Port";
private static String tableName = "SOME_TABLE";
Class c = Class.forName(driverName);
Connection con = DriverManager.getConnection(hiveConnectionString, "user", "pwd");
Statement stmt = con.createStatement();
String sql = "select * from "+tableName;
ResultSet res = stmt.executeQuery(sql);
parseResultsToObjects(res);

Resources