What is a good Databricks workflow

What is a good Databricks workflow - azure

I'm using Azure Databricks for data processing, with notebooks and pipeline.
I'm not satisfied with my current workflow:
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Git integration is very simple, but this is not my main concern.

Great question. Definitely dont modify your production code in place.
One recommended pattern is to keep separate folders in your workspace for dev-staging-prod. Do your dev work and then run tests in staging before finally promoting to production.
You can use the Databricks CLI to pull and push a notebook from one folder to another without breaking existing code. Going one step further, you can incorporate this pattern with git to sync with version control. In either case, the CLI gives you programmatic access to the workspace and that should make it easier to update code for production jobs.
Regarding your second point about IDEs - Databricks offers Databricks Connect, which let's you use your IDE while running commands on a cluster. Based on your pain points I think this is a great solution for you, as it will give your more visibility into the functions you have defined and so on. You can also write and run your unit tests this way.
Once you have your scripts ready to go you can always import them into the workspace as a notebook and run it as a job. Also know that you can run .py scripts as a job using the REST API.

I personally prefer to package my code, and copy the *.whl package to DBFS, where I can install the tested package and import it.
Edit: To be more explicit.
The notebook used in production can't be modified without breaking the production. When I want to develop an update, I duplicate the notebook, change the source code until I'm satisfied, then I replace the production notebook with my new notebook.
This can be solved by either having separate environments DEV/TST/PRD. Or having versioned packages that can be modified in isolation. I'll clarify later on.
My browser is not an IDE! I can't easily go to a function definition. I have lots of notebooks, if I want to modify or even just see the documentation of a function, I need to switch to the notebook where this function is defined.
Is there a way to do efficient and systematic testing ?
Yes, using the versioned packages method I mentioned in combination with databricks-connect, you are totally able to use your IDE, implement tests, have proper git integration.
Git integration is very simple, but this is not my main concern.
Built-in git integration is actually very poor when working in bigger teams. You can't develop in the same notebook simultaneously, as there's a flat and linear accumulation of changes that are shared with your colleagues. Besides that, you have to link and unlink repositories that are prone to human error, causing your notebooks to be synchronized in the wrong folders, causing runs to break because notebooks can't be imported. I advise you to also use my packaging solution.
The packaging solution works as follows Reference:
List item
On your desktop, install pyspark
Download some anonymized data to work with
Develop your code with small bits of data, writing unit tests
When ready to test on big data, uninstall pyspark, install databricks-connect
When performance and integration is sufficient, push code to your remote repo
Create a build pipeline that runs automated tests, and builds the versioned package
Create a release pipeline that copies the versioned package to DBFS
In a "runner notebook" accept "process_date" and "data folder/filepath" as arguments, and import modules from your versioned package
Pass the arguments to your module to run your tested code

The way we are doing it -
-Integrate the Dev notebooks with Azure DevOps.
-Create custom Build and Deployment tasks for Notebook, Jobs, package and cluster deployments. This is sort of easy to do with the DatabBricks RestAPI
https://docs.databricks.com/dev-tools/api/latest/index.html
Create Release pipeline for Test, Staging and Production deployments.
-Deploy on Test and test.
-Deploy on Staging and test.
-Deploy on production
Hope this can help.

Related

How to restore NuGet package in Azure Pipeline?

I am new to Azure DevOps and trying to create my first Azure pipeline. I have a ASP.NET MVC project and there are a few NuGet packages that need to be restored before the MSBuild step.
Unfortunately, the NuGet restore is failing with the following error:
The pipeline is not valid. Job Job_1: Step 'NuGetCommand' references
task 'NuGetCommand' at version '2.194.0' contains an execution handler
that relies on NodeJS version '6' which is restricted by your
administrator.
NodeJS 6 came disabled out of the box so we are not going to enable it.
My Questions:
Is there an alternative to NuGet restore that does not use NodeJS?
Is there a way to update the NodeJS6 to a higher version?
update 23-Nov-2021
I have found a work around for the time being. I am using a custom PowerShell script to restore NuGet Packages and build Visual Studio project
$msBuildExe = 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\MSBuild\Current\Bin\MSBuild.exe'
Write-Host "Restoring NuGet packages" -foregroundcolor green
& "$($msBuildExe)" "$($path)" /p:Configuration=Release /p:platform=x86 /t:restore
Note: $path here is the path to my .csproj file
Apparently, other people are also getting the same issue and it is just a matter of time that the task is updated by the OpenSource community.
Here are some similar issues being faced in other tasks as well:
https://github.com/microsoft/azure-pipelines-tasks/issues/15526
https://github.com/microsoft/azure-pipelines-tasks/issues/15511
https://github.com/microsoft/azure-pipelines-tasks/issues/15516
https://github.com/microsoft/azure-pipelines-tasks/issues/15525

It's AzureDevOps' NuGetCommand task that uses NodeJS, not NuGet itself. Therefore, you can find a way to restore without using Azure DevOps' NuGetCommand task.
Idea 1: use DotnetCoreCli task instead. However, this probably won't work for you since you said your project is ASP.NET MVC, rather than ASP.NET Core. Also, it also appears to need NodeJS to run.
Idea 2: Use MSBuild restore. You can test on your local machine whether or not this works by clearing your global packages folder, or temporarily configuring NuGet to use a different path, and then running msbuild -t:restore My.sln from a Developer PowerShell For Visual Studio prompt. If your project uses packages.config, rather than PackageReference, you'll need to also pass -p:RestorePackagesConfig=true (although maybe this is currently broken). I'm not an expert on Azure Pipelines tasks, so I don't know what it means that this task defines both PowerShell and Node execution entry points, but maybe it means it will work even if your CI agent doesn't allow NodeJS.
Idea 3: Don't use any of the built-in tasks, just use - script: or - task: PowerShell#2, but even that is a little questionable whether it'll work since even the powershell task defines a Node execution entry point. I'm guessing it will work, but I don't have access to a CI agent where NodeJS is forbidden, so I couldn't test even if I wanted to. Anyway, if this works, then you can run MSBuild yourself (but it might also be your responsibility to find msbuild.exe if it's not on the path). Or you can download nuget.exe yourself and execute it in your script. The point is, if you can get Azure Pipeline's script task working, you can run any script and do everything you need yourself.
Idea 4: Use Microsoft Hosted agents. They have documented all the software they pre-install on the machines, which includes Node JS. Downside is that once you exceed the free quota it costs money, and I've worked for companies where it's easier to get money to buy hardware once-off, and pretend that maintenance of the server is free, even though it reduces team productivity, rather than pay for a monthly service. So, I'll totally understand if this is not an option for you.
Idea 5: Talk to whoever maintains your CI agents and convince them to allow & install NodeJS. It's clearly a fundamental part of Azure Pipelines. The tasks are open source on github, and you can see that pretty much all of them use NodeJS to orchestrate whatever work it does. Frankly, I thought the agent software itself was a NodeJS application, so I'm surprised that it runs without NodeJS.

How to write automated test for Gitlab CI/CD configuration?

I have notified the complexity of my configuration for Gitlab CI/CD grows fairly fast. Normally, in a either a programming language or infrastructure code I could write automated test for the code.
Is it possible to write automated test for the gitlab-ci.yml file itself?
Does it exist any library or testing framework for it?
Ideally I would like to setup the different environments variables, execute the gitlab-ci.yml file, and do assertions on the output.

I am not aware of any tool currently to really test behaviour. But i ended up with two approaches:
1. testing in smaller chunks aka unit testing my building blocks.
I extracted my builds into smaller chunks which i can include, and test seperately. This works fine for script blocks etc. but is not really sufficient for rule blocks - but offers great value, eg i use the templates for some steps as verification for the generated docker image, which will be used by those steps.
2. verification via gitlab-ci-local
https://github.com/firecow/gitlab-ci-local is a tool which allows you to test it locally, and which allows you to provide environment variables. There are some minor issues with branch name resolution in pr pipelines but besides that it works great. I use it to test gitlab ci files in GitHub Actions.
I hope that this helps somehow

Gitlab have a linter for gitlab-ci.yaml
To access the CI Lint tool, navigate to CI/CD > Pipelines or CI/CD > Jobs in your project and click CI lint.
Docs

Azure Data Factory V2 multiple environments like in SSIS

I'm coming from a long SSIS background, we're looking to use Azure data factory v2 but I'm struggling to find any (clear) way of working with multiple environments. In SSIS we would have project parameters tied to the Visual Studio project configuration (e.g. development/test/production etc...) and say there were 2 parameters for SourceServerName and DestinationServerName, these would point to different servers if we were in development or test.
From my initial playing around I can't see any way to do this in data factory. I've searched google of course, but any information I've found seems to be around CI/CD then talks about Git 'branches' and is difficult to follow.
I'm basically looking for a very simple explanation and example of how this would be achieved in Azure data factory v2 (if it is even possible).

It works differently. You create an instance of data factory per environment and your environments are effectively embedded in each instance.
So here's one simple approach:
Create three data factories: dev, test, prod
Create your linked services in the dev environment pointing at dev sources and targets
Create the same named linked services in test, but of course these point at your tst systems
Now when you "migrate" your pipelines from dev to test, they use the same logical name (just like a connection manager)
So you don't designate an environment at execution time or map variables or anything... everything in test just runs against test because that's the way the linked servers have been defined.
That's the first step.
The next step is to connect only the dev ADF instance to Git. If you're a newcomer to Git it can be daunting but it's just a version control system. You save your code to it and it remembers every change you made.
Once your pipeline code is in git, the theory is that you migrate code out of git into higher environments in an automated fashion.
If you go through the links provided in the other answer, you'll see how you set it up.
I do have an issue with this approach though - you have to look up all of your environment values in keystore, which to me is silly because why do we need to designate the test servers hostname everytime we deploy to test?
One last thing is that if you a pipeline that doesn't use a linked service (say a REST pipeline), I haven't found a way to make that environment aware. I ended up building logic around the current data factories name to dynamically change endpoints.
This is a bit of a bran dump but feel free to ask questions.

Although it's not recommended - yes, you can do it.
Take a look at Linked Service - in this case, I have a connection to Azure SQL Database:
You have possibilities to use dynamic content for either the server name and database name.
Just add a parameter to your pipeline, pass it to the Linked Service and use in the required field.
Let me know whether I explained it clearly enough?

Yes, it's possible although not so simple as it was in VS for SSIS.
1) First of all: there is no desktop application for developing ADF, only the browser.
Therefore developers should make the changes in their DEV environment and from many reasons, the best way to do it is a way of working with GIT repository connected.
2) Then, you need "only":
a) publish the changes (it creates/updates adf_publish branch in git)
b) With Azure DevOps deploy the code from adf_publish replacing required parameters for target environment.
I know that at the beginning it sounds horrible, but the sooner you set up an environment like this the more time you save while developing pipelines.
How to do these things step by step?
I describe all the steps in the following posts:
- Setting up Code Repository for Azure Data Factory v2
- Deployment of Azure Data Factory with Azure DevOps
I hope this helps.

Variable substitution in build pipeline

There are tons of resources online on how to replace JSON configuration files in a release pipeline like this one. I configured this. It works. However, we have multiple integration tests which reach the database too. These tests are run during build time. I haven't seen any option yet to replace config values in the build pipeline. Does it exist? Or do I really have to use this custom task (see screenshot below)?

There is an out-of-the-box task since recently by Microsoft. It's called File Transform. It's currently in preview but it works really well! Haven't had any issues whatsoever with it and it works the same as you would configure it in the release pipeline. Would recommend this any day!
Below you can see my configuration.

There is no out-of-the-box task only to replace tokens/values in files (also in the release pipline the task is Azure App Service Deploy and not only for replace json configuration).
You need to use an external extension from here or write a PowerShell script for that.

How to update repository with built project?

I’m trying to set up GitLab CI/CD for an old client-side project that makes use of Grunt (https://github.com/yeoman/generator-angular).
Up to now the deployment worked like this:
run ’$ grunt build’ locally which built the project and created files in a ‘dist’ folder in the root of the project
commit changes
changes pulled onto production server
After creating the .gitlab-ci.yml and making a commit, the GitLab CI/CD job passes but the files in the ‘dist’ folder in the repository are not updated. If I define an artifact, I will get the changed files in the download. However I would prefer the files in ‘dist’ folder in the to be updated so we can carry on with the same workflow which suits us. Is this achievable?

I don't think commiting into your repo inside a pipeline is a good idea. Version control wouldn't be as clear, some people have automatic pipeline trigger when their repo is pushed, that'd trigger a loop of pipelines.
Instead, you might reorganize your environment to use Docker, there are numerous reasons for using Docker in a professional and development environments. To name just a few: that'd enable you to save the freshly built project into a registry and reuse it whenever needed right with the version you require and with the desired /dist inside. So that you can easily run it in multiple places, scale it, manage it etc.
If you changed to Docker you wouldn't actually have to do a thing in order to have the dist persistent, just push the image to the registry after the build is done.
But to actually answer your question:
There is a feature request hanging for a very long time for the same problem you asked about: here. Currently there is no safe and professional way to do it as GitLab members state. Although you can push back changes as one of the GitLab members suggested (Kamil Trzciński):
git push http://gitlab.com/group/project.git HEAD:my-branch
Just put it in your script section inside gitlab-ci file.
There are more hack'y methods presented there, but be sure to acknowledge risks that come with them (pipelines are more error prone and if configured in a wrong way, they might for example publish some confidential information and trigger an infinite pipelines loop to name a few).
I hope you found this useful.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string