Databricks Lakehouse JDBC and Docker - apache-spark

Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.

No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)

Related

Pyspark: How to setup multiple JDBC connections?

Usecase: I have two databases, one for prod and one for dev. The prod uses an SAP JDBC driver, and the dev uses an Oracle JDBC driver as they are based on different DB's. I have to fetch data from prod DB, perform few operations and save it in dev DB for few project needs.
Issue: Currently am using these third-party drivers by setting "spark.driver.extraClassPath" in Spark Context. But this takes in only one argument. Thus, I am able to connect to only one of the DB's at a time.
Is there are any way I can make two different JDBC class path configuration? If not, then how can I approach this issue? Any guidance is much appreciated!!
Solution:
Instead of defining the driver file path, providing the folder path loads all drivers in that folder. So, in my case, I placed both the SAP and Oracle JDBC drivers in same folder and mentioned it in the Spark Context Configuration like shown in the below snippet.
.set("spark.driver.extraClassPath", r"<folder_path_jdbc_drivers>\*")

Spark job as a web service?

A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.

Best Pyspark Testing : issue with databricks -connect

I'm currently using databricks and in order to test my databricks code I'm using databricks connect in VS code. While I'm using databricks connect, since yesterday suddenly it started behaving strange, while I'm submitting a code from VS Code, the databricks-connect is taking the access of the person who has created the databricks cluster, now he doesn't have adequate access over all the resources and my test cases/ code is failing due to access issue.
Some more inputs: I have a function which does an update operation in a delta table, so this means i can't use a normal hive or temp table as they doesn't support update operation.
I have tried delta-lake in local, but that seems tobe not working and also not that convenient, as i wouldn't be able to access my ADLs location (until i specifically do the configuration change)
So, mu question is , how you guys are doing a spark specific testing? (I'm using pytest).
I found we don't have much material for a databricks code testing in the internet..any help?

Creating catalog/schema/table in prestosql/presto container

I would like to use prestosql/presto container for automated tests. For this purpose I want to receive the ability to programmatically to create catalog/schema/table. Unfortunately, I didn't find the option via docker environment variables. If I trying to do it via jdbc connector, I receive following error:"This connector does not support creating tables"
How can I create schemas or tables using prestosql/presto container?
If you are writing tests in Java (as suggested by JDBC tag), you can use testcontainers library. It comes with Presto module.
uses prestosql/presto container under the hood
comes with Presto memory connector pre-installed, so you can create schemas & tables there

Possibilities of Hadoop with MSSQL Reporting

I have been evaluating Hadoop on azure HDInsight to find a big data solution for our reporting application. The key part of this technology evaluation is that the I need to integrate with MSSQL Reporting Services as that is what our application already uses. We are very short on developer resources so the more I can make this into an engineering exercise the better. What I have tried so far
Use an ODBC connection from MSSQL mapped to the Hive on HDInsight.
Use an ODBC connection from MSSQL using HBASE on HDInsight.
Use SPARKQL locally on the azure HDInsight Remote desktop
What I have found is that HBASE and Hive are far slower to use with our reports. For test data I used a table with 60k rows and found that the report on MSSQL ran in less than 10 seconds. I ran the query on the hive query console and on the ODBC connection and found that it took over a minute to execute. Spark was faster (30 seconds) but there is no way to connect to it externally since ports cannot be opened on the HDInsight cluster.
Big data and Hadoop are all new to me. My question is, am I looking for Hadoop to do something it is not designed to do and are there ways to make this faster?I have considered caching results and periodically refreshing them, but it sounds like a management nightmare. Kylin looks promising but we are pretty married to windows azure, so I am not sure that is a viable solution.
Look at this documentation on optimizing Hive queries: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-optimize-hive-query/
Specifically look at ORC and using Tez. I would create a cluster that has Tez on by default and then store your data in ORC format. Your queries should be much more performant then.
If going through Spark is fast enough, you should consider using the Microsoft Spark ODBC driver. I am using it and the performance is not comparable to what you'll get with MSSQL, other RDBMS or something like ElasticSearch but it does work pretty reliably.

Resources