I'm using Airflow to schedule for spark job and using a conf.properties file.
I want to change this file in Airflow UI not in server CLI.
How cant I do??
Airflow webserver doesn't support files edit in its UI. But it allows you to add your plugins and customize the UI by adding flask_appbuilder views (here is the doc).
You can also use an unofficial open source plugins to do that (ex: airflow_code_editor).
We are using Databricks to generate ETL scripts. One step requires us to upload small csvs into a Repos folder. I can do this manually using the import window in the Repos GUI. However, i would like to do this programmatically using the databricks cli. Is this possible? I have tried using the Workspace API, but this only works for sourcecode files.
Unfortunately it's not possible as of right now, because there is no API for that that could be used by databricks-cli. But you can add and commit files to the Git repository, and then use databricks repos update to pull them inside the workspace.
I'm using Azure Databricks for data processing, with notebooks and pipeline.
I'm not satisfied with my current workflow:
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Git integration is very simple, but this is not my main concern.
Great question. Definitely dont modify your production code in place.
One recommended pattern is to keep separate folders in your workspace for dev-staging-prod. Do your dev work and then run tests in staging before finally promoting to production.
You can use the Databricks CLI to pull and push a notebook from one folder to another without breaking existing code. Going one step further, you can incorporate this pattern with git to sync with version control. In either case, the CLI gives you programmatic access to the workspace and that should make it easier to update code for production jobs.
Regarding your second point about IDEs - Databricks offers Databricks Connect, which let's you use your IDE while running commands on a cluster. Based on your pain points I think this is a great solution for you, as it will give your more visibility into the functions you have defined and so on. You can also write and run your unit tests this way.
Once you have your scripts ready to go you can always import them into the workspace as a notebook and run it as a job. Also know that you can run .py scripts as a job using the REST API.
I personally prefer to package my code, and copy the *.whl package to DBFS, where I can install the tested package and import it.
Edit: To be more explicit.
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
This can be solved by either having separate environments DEV/TST/PRD. Or having versioned packages that can be modified in isolation. I'll clarify later on.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Yes, using the versioned packages method I mentioned in combination with databricks-connect, you are totally able to use your IDE, implement tests, have proper git integration.
Git integration is very simple, but this is not my main concern.
Built-in git integration is actually very poor when working in bigger teams. You can't develop in the same notebook simultaneously, as there's a flat and linear accumulation of changes that are shared with your colleagues. Besides that, you have to link and unlink repositories that are prone to human error, causing your notebooks to be synchronized in the wrong folders, causing runs to break because notebooks can't be imported. I advise you to also use my packaging solution.
The packaging solution works as follows Reference:
List item
On your desktop, install pyspark
Download some anonymized data to work with
Develop your code with small bits of data, writing unit tests
When ready to test on big data, uninstall pyspark, install databricks-connect
When performance and integration is sufficient, push code to your remote repo
Create a build pipeline that runs automated tests, and builds the versioned package
Create a release pipeline that copies the versioned package to DBFS
In a "runner notebook" accept "process_date" and "data folder/filepath" as arguments, and import modules from your versioned package
Pass the arguments to your module to run your tested code
The way we are doing it -
-Integrate the Dev notebooks with Azure DevOps.
-Create custom Build and Deployment tasks for Notebook, Jobs, package and cluster deployments. This is sort of easy to do with the DatabBricks RestAPI
https://docs.databricks.com/dev-tools/api/latest/index.html
Create Release pipeline for Test, Staging and Production deployments.
-Deploy on Test and test.
-Deploy on Staging and test.
-Deploy on production
Hope this can help.
Is it possible for a pipeline to have multiple triggers in one YAML file that executes different jobs per trigger?
In our pipeline, we pack each project in the solution and push it as a nuget package in our own azure devops artifacts and want to do the packing and pushing depending on the project. Saw that it is possible to specify the branch and path in the trigger, but you can only have one trigger according to this. But he only indicated it in the question, and the documentation doesn't explicitly state it.
Right now my option is to just configure different pipelines with yaml files per project but I want to ask here to confirm if this is possible or not.
Agree with Jessehouwing You can add multiple triggers. You can use conditionals on tasks, jobs, stages and environments to only run in specific cases.
https://learn.microsoft.com/en-us/azure/devops/pipelines/yaml-schema?view=azure-devops&tabs=schema#triggers
https://learn.microsoft.com/en-us/azure/devops/pipelines/process/conditions?tabs=yaml&view=azure-devops
Thanks for the input, studied the docs but it's not possible to achieve what I wanted with just the built in tasks for azure devops. I had to make a script that does it and assign true of false values to the conditionals.
The exact answer I was looking for was in this post
There are tons of resources online on how to replace JSON configuration files in a release pipeline like this one. I configured this. It works. However, we have multiple integration tests which reach the database too. These tests are run during build time. I haven't seen any option yet to replace config values in the build pipeline. Does it exist? Or do I really have to use this custom task (see screenshot below)?
There is an out-of-the-box task since recently by Microsoft. It's called File Transform. It's currently in preview but it works really well! Haven't had any issues whatsoever with it and it works the same as you would configure it in the release pipeline. Would recommend this any day!
Below you can see my configuration.
There is no out-of-the-box task only to replace tokens/values in files (also in the release pipline the task is Azure App Service Deploy and not only for replace json configuration).
You need to use an external extension from here or write a PowerShell script for that.