I am using the dbt CLI to regularly update data via dbt run. However, this materializes several tables, and can take 20+ hours to do a full refresh.
I am currently running this from my PC/cloud VM, but I don't want to keep my PC on / VM running just to run the dbt CLI. Moreover, I've been burned before by trying to do this (brief Wi-Fi issue interrupting a dbt operation 10h into a 12h table materialization).
Are there any good patterns for this? Note that I'm using SQL Server which is not supported by DBT cloud.
I've considered:
Setting up airflow / prefect
Having a small vm just for DBT to run
Moving to a faster database (eg. from Azure SQL to Azure Synapse)
Any ideas?
I would agree here with Branden. Throwing resources should be the last resort. First thing that you should do is try optimizing sql queries. If the queries are optimized the time for full refresh takes will depend upon data volume. If the volume is high you should be doing incremental runs rather than full refreshes. You can schedule for incremental runs using something like cron scheduler or airflow
Another thing to note is you don't need to do dbt run if you want to run selected models. You can always do dbt run -m +model
+model -> run the model with all upstream dependencies
model- -> run the model with all downstream dependencies
Another aspect, since you're using SQL Server which is row store (more suited to ETL) you also need some dimensional modeling. dbt is the T of ELT in which data is already loaded in a powerful column store warehouse (like snowflake/redshift) and dimensional modeling is not needed as queries already leverage the columnar storage. Doesn't mean dbt cannot work with row stores but dimensional modeling may be needed.
2nd. You can always have a small VM or run it on something like ECS Fargate. This is a serverless solution and you're charged only when dbt runs.
Finally, if nothing works then you should consider of moving to something like Synapse that will likely use compute intense resources to run queries faster.
Related
I have a pyspark batch job scheduled on YARN. There is now a requirement to put the logic of the spark job into a web service.
I really don't want there to be 2 copies of the same code, and therefore would like to somehow reuse the spark code inside the service, only replacing the IO parts.
The expected size of the workloads per request is small so I don't want to complicate the service by turning it into a distributed application. I would like instead to run the spark code in local mode inside the service. How do I do that? Is that even a good idea? Are there better alternatives?
A peer of mine has created code that opens a restful api web service within an interactive spark job. The intent of our company is to use his code as a means of extracting data from various datasources. He can get it to work on his machine with a local instance of spark. He insists that this is a good idea and it is my job as DevOps to implement it with Azure Databricks.
As I understand it interactive jobs are for one-time analytics inquiries and for the development of non-interactive jobs to be run solely as ETL/ELT work between data sources. There is of course the added problem of determining the endpoint for the service binding within the spark cluster.
But I'm new to spark and I have scarcely delved into the mountain of documentation that exists for all the implementations of spark. Is what he's trying to do a good idea? Is it even possible?
The web-service would need to act as a Spark Driver. Just like you'd run spark-shell, run some commands , and then use collect() methods to bring all data to be shown in the local environment, that all runs in a singular JVM environment. It would submit executors to a remote Spark cluster, then bring the data back over the network. Apache Livy is one existing implementation for a REST Spark submission server.
It can be done, but depending on the process, it would be very asynchronous, and it is not suggested for large datasets, which Spark is meant for. Depending on the data that you need (e.g. highly using SparkSQL), it'd be better to query a database directly.
Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)
Im having some difficulties figuring out how to use spark's machine learning capabilities in a real life production environment.
What i want to do is the following:
Develop a new ml model using notebooks
Serve the learned model using REST api (something like POST - /api/v1/mymodel/predict)
Let say the ml training process is handled by a notebook, and once the model requirements are fulfilled it's saved into an hdfs file, to be later loaded by a spark application
I know i could write a long running spark application that exposes the api and run it on my spark cluster, but i don't think this is really a scalable approach, because even if the data transformations and the ml functions would run on the workers node, the http/api related code would still run on one node, the one on wich spark-submit is invoked (correct me if i'm wrong).
One other approach is to use the same long running application, but in a local-standalone cluster. I could deploy the same application as many times as i want, and put a load balancer in front of it. With this approach the http/api part is handled fine, but the spark part is not using the cluster capabilities at all (this could not be a problem, due to fact that it should only perform a single prediction per request)
There is a third approach wich uses SparkLauncher, wich wraps the spark job in a separate jar, but i don't really like flying jars, and it is difficult to retrieve the result of the prediction (a queue maybe, or hdfs)
So basically the question is: what is the best approach to consume spark's ml models through rest api?
Thank You
you have three options
trigger batch ML job via spark api spark-jobserver, upon client request
trigger batch ML job via scheduler airflow , write output to DB, expose DB via rest to client
keep structured-streaming / recursive functionon to scan input data source, update / append DB continuously, expose DB via rest to client
If you have single prediction per request, and your data input is constantly updated, I would suggest option 3, which would transform data in near-real-time at all times, and client would have constant access to output, you can notify client when new data is completed by sending notification via rest or sns, you could keep pretty small spark cluster that would handle data ingest, and scale rest service and DB upon request / data volume (load balancer)
If you anticipate rare requests where data source is updated periodically lets say once a day, option 1 or 2 will be suitable as you can launch bigger cluster and shut it down when completed.
Hope it helps.
The problem is you don't want to keep your spark cluster running and deploy your REST API inside it for the prediction as it's slow.
So to achieve real-time prediction with low latency, Here are a couple of solutions.
What we are doing is Training the model, exporting the model and use the model outside Spark to do the Prediction.
You can export the model as a PMML file if the ML Algorithm you used is supported by the PMML standards. Spark ML Models can be exported as JPMML file using the jpmml library. And then you can create your REST API and use JPMML Evaluator to predict using your Spark ML Models.
MLEAP MLeap is a common serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn and Tensorflow for training pipelines and exporting them to an MLeap Bundle. Serialized pipelines (bundles) can be deserialized back into Spark for batch-mode scoring or the MLeap runtime to power realtime API services. It supports multiple platforms, though I have just used it for Spark ML models and it works really well.
If I have data in a Hadoop Cluster or SQL Elastic DB, is ML bringing that data onto ML servers, or leaving it on Hadoop/sql and running its analysis there?
Currently, Azure Machine Learning will bring that data onto ML servers.
Execution for each of the modules is done on AzureML's backend servers. But you can connect to the databases through either "Reader" modules, or say Python code using ODBC to issue queries to a database and get the results as the return type, in which case the query is being down on the data servers, and the results are sent to AzureML. This is useful if you want to do a data aggregation queries in Hive or SQL to reduce the size of your dataset before bringing it into AzureML.