How to query and join csv data with Hbase data in Spark Cluster in Azure - azure

In Microsoft Azure, we can create Spark cluster in Azure HDInsight and create Hbase cluster in Azure HDInsight. Now I have created this 2 kinds of clusters. For the Spark cluster, I can create a dataframe from a csv file and run an SQL query like this (below query is executed in Jupyter notebook):
%%sql
SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
In the meantime, in the spark shell, I can create a connector to another HBase cluster to query data table in that HBase like this:
val query = spark.sqlContext.sql("select personalName, officeAddress from contacts")
query.show()
so, my question is is there a way to do the join operation against this two tables? For example:
select * from hvac a inner join contacts b on a.id = b.id
I just reference below 2 documents in Microsoft Azure:
Run queries on Spark Cluster
Use Spark to read and write HBase data
Any ideas or suggestion for this?

Related

Pyspark on EMR and external hive/glue - can drop but not create tables via sqlContext

I'm writing a dataframe to an external hive table from pyspark running on EMR. The work involves dropping/truncating data from an external hive table, writing the contents of a dataframe into aforementioned table, then writing the data from hive to DynamoDB. I am looking to write to an internal table on the EMR cluster but for now I would like the hive data to be available to subsequent clusters. I could write to the Glue catalog directly and force it to registered but that is a step further than I need to go.
All components work fine individually on a given EMR cluster: I can create an external hive table on EMR, either using a script or ssh and hive shell. This table can be queried by Athena and can be read from by pyspark. I can create a dataframe and INSERT OVERWRITE the data into the aforementioned table in pyspark.
I can then use hive shell to copy the data from the hive table into a DynamoDB table.
I'd like to wrap all of the work into the one pyspark script instead of having to submit multiple distinct steps.
I am able to drop tables using
sqlContext.sql("drop table if exists default.my_table")
When I try to create a table using sqlContext.sql("create table default.mytable(id string,val string) STORED AS ORC") I get the following error:
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-xx-xxx-xx-xxx/xx.xxx.xx.xx to ip-xxx-xx-xx-xx:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=ip-xxx-xx-xx-xx:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
I can't figure out why I can create an external hive table in Glue using hive shell on the cluster, drop the table using hive shell or pyspark sqlcontext, but I can't create a table using sqlcontext. I have checked around and the solutions offered don't make sense in this context (copying hive-site.xml) as I can clearly write to the required addresses with no hassle, just not in pyspark. And it is doubly strange that I can drop the tables with them being definitely dropped when I check in Athena.
Running on:
emr-5.28.0,
Hadoop distribution Amazon 2.8.5
Spark 2.4.4
Hive 2.3.6
Livy 0.6.0 (for notebooks but my experimentation is via ssh and pyspark shell)
Turns out I could create tables via a spark.sql() call as long as I provided a location for the tables. Seems like Hive shell doesn't require it, yet spark.sql() does. Not expected but not entirely unsurprising.
Complementing #Zeathor's answer. After configuring the EMR and Glue connection and permission (you can check more in here: https://www.youtube.com/watch?v=w20tapeW1ME), you will just need to write sparkSQL commands:
spark = SparkSession.builder.appName('TestSession').getOrCreate()
spark.sql("create database if not exists test")
You can then create your tables from dataframes:
df.createOrReplaceTempView("first_table");
spark.sql("create table test.table_name as select * from first_table");
All the databases and tables metadata will then be stored in AWS Glue Catalogue.

SQL query taking too long in azure databricks

I want to execute SQL query on a DB which is in Azure SQL managed instance using Azure Databricks. I have connected to DB using spark connector.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val config = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"queryCustom" -> "SELECT TOP 100 * FROM dbo.Clients WHERE PostalCode = 98074" //Sql query
"user" -> "username",
"password" -> "*********",
))
//Read all data in table dbo.Clients
val collection = sqlContext.read.sqlDB(config)
collection.show()
I am using above method to fetch the data(Example from MSFT doc). Table sizes are over 10M in my case. My question is How does Databricks process the query here?
Below is the documentation:
The Spark master node connects to databases in SQL Database or SQL Server and loads data from a specific table or using a specific SQL query.
The Spark master node distributes data to worker nodes for transformation.
The Worker node connects to databases that connect to SQL Database and SQL Server and writes data to the database. User can choose to use row-by-row insertion or bulk insert.
It says master node fetches the data and distributes the work to worker nodes later. In the above code, while fetching the data what if the query itself is complex and takes time? Does it spread the work to worker nodes? or I have to fetch the tables data first to Spark and then run the SQL query to get the result. Which method do you suggest?
So using the above method uses a single JDBC connection to pull the table into the Spark environment.
And if you want to use the push down predicate on the query then you can use in this way.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query,
properties=connectionProperties)
display(df)
If you want to improve the performance than you need to manage parallelism while reading.
You can provide split boundaries based on the dataset’s column values.
These options specify the parallelism on read. These options must all be specified if any of them is specified. lowerBound and upperBound decide the partition stride, but do not filter the rows in table. Therefore, Spark partitions and returns all rows in the table.
The following example splits the table read across executors on the emp_no column using the columnName, lowerBound, upperBound, and numPartitions parameters.
val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)
For more Details : use this link

Accessing Hive Tables from Spark SQL when Data is Stored in Object Storage

I am using spark dataframe writer to write the data in internal hive tables in parquet format in IBM Cloud Object Storage.
So , my hive metastore is in HDP cluster and I am running the spark job from the HDP cluster. This spark job writes the data to the IBM COS in parquet format.
This is how I am starting the spark session
SparkSession session = SparkSession.builder().appName("ParquetReadWrite")
.config("hive.metastore.uris", "<thrift_url>")
.config("spark.sql.sources.bucketing.enabled", true)
.enableHiveSupport()
.master("yarn").getOrCreate();
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.api.key",credentials.get(ConnectionConstants.COS_APIKEY));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.iam.service.id",credentials.get(ConnectionConstants.COS_SERVICE_ID));
session.sparkContext().hadoopConfiguration().set("fs.cos.mpcos.endpoint",credentials.get(ConnectionConstants.COS_ENDPOINT));
The issue that I am facing is that when I partition the data and store it (via partitionBy) I am unable to access the data directly from spark sql
spark.sql("select * from partitioned_table").show
To fetch the data from the partitioned table , I have to load the dataframe and register it as a temp table and then query it.
The above issue does not occur when the table is not partitioned.
The code to write the data is this
dfWithSchema.orderBy(sortKey).write()
.partitionBy("somekey")
.mode("append")
.format("parquet")
.option("path",PARQUET_PATH+tableName )
.saveAsTable(tableName);
Any idea why the the direct query approach is not working for the partitioned tables in COS/Parquet ?
To read the partitioned table(created by Spark), you need to give the absolute path of the table as below.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
To filter out it further, please try the below approach.
selected_Data.where(col("column_name")=='col_value').show()
This issue occurs when the property hive.metastore.try.direct.sql is set to true on the HiveMetastore configurations and the SparkSQL query is run over a non STRING type partition column.
For Spark, it is recommended to create tables with partition columns of STRING type.
If you are getting below error message while filtering the hive partitioned table in spark.
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
recreate your hive partitioned table with partition column datatype as string, then you would be able to access the data directly from spark sql.
else you have to specify the absolute path of your hdfs location to get the data incase your partitioned column has been defined as varchar.
selected_Data=spark.read.format("parquet").option("header","false").load("hdfs/path/loc.db/partition_table")
However I was not able to understand, why it's differentiating in between a varchar and string datatype for partition column

Spark SQL - Options for deploying SQL queries on Spark Streams

I'm new to Spark and would like to run a Spark SQL query over Spark streams.
My current understanding is that I would need to define my SQL query in the code of my Spark job, as this snippet lifted from the Spark SQ home page shows:-
spark.read.json("s3n://...")
  .registerTempTable("json")
results = spark.sql(
  """SELECT *
     FROM people
     JOIN json ...""")
What I want to do is define my query on its own somewhere - eg. .sql file - and then deploy it over a Spark cluster.
Can anyone tell me if Spark currently has any support for this architecture? eg. some API?
you can use python with open to fill your purpose:
with open('filepath/filename.sql') as fr:
query = fr.read()
x = spark.sql(query)
x.show(5)
you could pass filename.sql as an argument while submitting your job using sys.argv[]
Please refer this link for more help: Spark SQL question

How to use different Hive metastore for saveAsTable?

I am using Spark SQL (Spark 1.6.1) using PySpark and I have a requirement of loading a table from one Hive metastore and writing the result of the dataframe into a different Hive metastore.
I am wondering how can I use two different metastores for one spark SQL script?
Here is my script looks like.
# Hive metastore 1
sc1 = SparkContext()
hiveContext1 = HiveContext(sc1)
hiveContext1.setConf("hive.metastore.warehouse.dir", "tmp/Metastore1")
#Hive metastore 2
sc2 = SparkContext()
hiveContext2 = HiveContext(sc2)
hiveContext2.setConf("hive.metastore.warehouse.dir", "tmp/Metastore2")
#Reading from a table presnt in metastore1
df_extract = hiveContext1.sql("select * from emp where emp_id =1")
# Need to write the result into a different dataframe
df_extract.saveAsTable('targetdbname.target_table',mode='append',path='maprfs:///abc/datapath...')
HotelsDotCom have developed an application (WaggleDance) specifically for this https://github.com/HotelsDotCom/waggle-dance. Using it as a proxy you should be able to achieve what your trying to do
TL;DR It is not possible to use one Hive metastore (for some tables) and another (for other tables).
Since Spark SQL supports a single Hive metastore (in a SharedState) regardless of the number of SparkSessions reading from and writing to different Hive metastores is technically impossible.

Resources