I would like to use prestosql/presto container for automated tests. For this purpose I want to receive the ability to programmatically to create catalog/schema/table. Unfortunately, I didn't find the option via docker environment variables. If I trying to do it via jdbc connector, I receive following error:"This connector does not support creating tables"
How can I create schemas or tables using prestosql/presto container?
If you are writing tests in Java (as suggested by JDBC tag), you can use testcontainers library. It comes with Presto module.
uses prestosql/presto container under the hood
comes with Presto memory connector pre-installed, so you can create schemas & tables there
Related
Usecase: I have two databases, one for prod and one for dev. The prod uses an SAP JDBC driver, and the dev uses an Oracle JDBC driver as they are based on different DB's. I have to fetch data from prod DB, perform few operations and save it in dev DB for few project needs.
Issue: Currently am using these third-party drivers by setting "spark.driver.extraClassPath" in Spark Context. But this takes in only one argument. Thus, I am able to connect to only one of the DB's at a time.
Is there are any way I can make two different JDBC class path configuration? If not, then how can I approach this issue? Any guidance is much appreciated!!
Solution:
Instead of defining the driver file path, providing the folder path loads all drivers in that folder. So, in my case, I placed both the SAP and Oracle JDBC drivers in same folder and mentioned it in the Spark Context Configuration like shown in the below snippet.
.set("spark.driver.extraClassPath", r"<folder_path_jdbc_drivers>\*")
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)
Using Azure Databricks.
I have petastorm==0.11.2 and databricks-connect==9.1.0
My databricks-connect session seems to be working I'm able to read in data into my remote workspace. But when I use petastorm to create a spark converter object it says unable to infer schema, even though if take the object I'm passing it and check its .schema attribute it shows me a schema just fine.
The exact same code works within the databricks workspace in the notebooks. But doesn't work when I'm on a separate VM using DBConnect to read in the data.
I think the issue is around setting this configuration: SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF. When in the local databricks workspace using the value 'file:///tmp/petastorm/cache/' works fine. When using databricks-connect it supposedly builds a spark context that's linked to the cluster and otherwise for read and write paths behaves fine.
Any ideas?
I am using presto version 179 and I need to manually create a database.properties file in /etc/presto/catalog through the CLI.
Can I do the same process from the GUI of presto?
Presto's built-in web interface does not provide any configuration capabilities.
Usually, such things are handled as part of deployment/configuration management on a cluster. Thus, configuration is provided by some external means just as is Presto installation.